Hi, We had change timeout on sharepoint IIS and now the process is able to crall all documents. Thanks for your help
El lun., 30 dic. 2019 a las 12:18, Gaurav G (<goyalgaur...@gmail.com>) escribió: > We had faced a similar issue, wherein our repo had 100,000 documents but > our crawler stopped after 50000 documents. The issue turned out to be that > the Sharepoint query that was fired by the Sharepoint web service gets > progressively slower and eventually the connection starts timing out before > the next 10000 records get returned. We increased a timeout parameter on > Sharepoint to 10 minutes and then after that we were able to crawl all > documents successfully. I believe we had increased the parameter indicated > in the link below > > > https://weblogs.asp.net/jeffwids/how-to-increase-the-timeout-for-a-sharepoint-2010-website > > > > On Fri, Dec 20, 2019 at 6:27 PM Karl Wright <daddy...@gmail.com> wrote: > >> Hi Priya, >> >> This has nothing to do with anything in ManifoldCF. >> >> Karl >> >> >> On Fri, Dec 20, 2019 at 7:56 AM Priya Arora <pr...@smartshore.nl> wrote: >> >>> Hi All, >>> >>> Is this issue something to have with below value/parameters set in >>> properties.xml. >>> [image: image.png] >>> >>> >>> On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia <jalon...@gmail.com> >>> wrote: >>> >>>> And what other sharepoint parameter I could check? >>>> >>>> Jorge Alonso Garcia >>>> >>>> >>>> >>>> El vie., 20 dic. 2019 a las 12:47, Karl Wright (<daddy...@gmail.com>) >>>> escribió: >>>> >>>>> The code seems correct and many people are using it without >>>>> encountering this problem. There may be another SharePoint configuration >>>>> parameter you also need to look at somewhere. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia < >>>>> jalon...@gmail.com> wrote: >>>>> >>>>>> >>>>>> Hi Karl, >>>>>> On sharepoint the list view threshold is 150,000 but we only receipt >>>>>> 20,000 from mcf >>>>>> [image: image.png] >>>>>> >>>>>> >>>>>> Jorge Alonso Garcia >>>>>> >>>>>> >>>>>> >>>>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright (<daddy...@gmail.com>) >>>>>> escribió: >>>>>> >>>>>>> If the job finished without error it implies that the number of >>>>>>> documents returned from this one library was 10000 when the service is >>>>>>> called the first time (starting at doc 0), 10000 when it's called the >>>>>>> second time (starting at doc 10000), and zero when it is called the >>>>>>> third >>>>>>> time (starting at doc 20000). >>>>>>> >>>>>>> The plugin code is unremarkable and actually gets results in chunks >>>>>>> of 1000 under the covers: >>>>>>> >>>>>>> >>>>>> >>>>>>> SPQuery listQuery = new SPQuery(); >>>>>>> listQuery.Query = "<OrderBy >>>>>>> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>"; >>>>>>> listQuery.QueryThrottleMode = >>>>>>> SPQueryThrottleOption.Override; >>>>>>> listQuery.ViewAttributes = >>>>>>> "Scope=\"Recursive\""; >>>>>>> listQuery.ViewFields = "<FieldRef >>>>>>> Name='FileRef' />"; >>>>>>> listQuery.RowLimit = 1000; >>>>>>> >>>>>>> XmlDocument doc = new XmlDocument(); >>>>>>> retVal = doc.CreateElement("GetListItems", >>>>>>> " >>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/"); >>>>>>> XmlNode getListItemsNode = >>>>>>> doc.CreateElement("GetListItemsResponse"); >>>>>>> >>>>>>> uint counter = 0; >>>>>>> do >>>>>>> { >>>>>>> if (counter >= startRowParam + >>>>>>> rowLimitParam) >>>>>>> break; >>>>>>> >>>>>>> SPListItemCollection collListItems = >>>>>>> oList.GetItems(listQuery); >>>>>>> >>>>>>> >>>>>>> foreach (SPListItem oListItem in >>>>>>> collListItems) >>>>>>> { >>>>>>> if (counter >= startRowParam && >>>>>>> counter < startRowParam + rowLimitParam) >>>>>>> { >>>>>>> XmlNode resultNode = >>>>>>> doc.CreateElement("GetListItemsResult"); >>>>>>> XmlAttribute idAttribute = >>>>>>> doc.CreateAttribute("FileRef"); >>>>>>> idAttribute.Value = >>>>>>> oListItem.Url; >>>>>>> >>>>>>> resultNode.Attributes.Append(idAttribute); >>>>>>> XmlAttribute urlAttribute = >>>>>>> doc.CreateAttribute("ListItemURL"); >>>>>>> //urlAttribute.Value = >>>>>>> oListItem.ParentList.DefaultViewUrl; >>>>>>> urlAttribute.Value = >>>>>>> string.Format("{0}?ID={1}", >>>>>>> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl, >>>>>>> oListItem.ID); >>>>>>> >>>>>>> resultNode.Attributes.Append(urlAttribute); >>>>>>> >>>>>>> getListItemsNode.AppendChild(resultNode); >>>>>>> } >>>>>>> counter++; >>>>>>> } >>>>>>> >>>>>>> listQuery.ListItemCollectionPosition = >>>>>>> collListItems.ListItemCollectionPosition; >>>>>>> >>>>>>> } while >>>>>>> (listQuery.ListItemCollectionPosition != null); >>>>>>> >>>>>>> retVal.AppendChild(getListItemsNode); >>>>>>> <<<<<< >>>>>>> >>>>>>> The code is clearly working if you get 20000 results returned, so I >>>>>>> submit that perhaps there's a configured limit in your SharePoint >>>>>>> instance >>>>>>> that prevents listing more than 20000. That's the only way I can >>>>>>> explain >>>>>>> this. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia < >>>>>>> jalon...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> The job finnish ok (several times) but always with this 20000 >>>>>>>> documents, for some reason the loop only execute twice >>>>>>>> >>>>>>>> Jorge Alonso Garcia >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<daddy...@gmail.com>) >>>>>>>> escribió: >>>>>>>> >>>>>>>>> If the are all in one document, then you'd be running this code: >>>>>>>>> >>>>>>>>> >>>>>> >>>>>>>>> int startingIndex = 0; >>>>>>>>> int amtToRequest = 10000; >>>>>>>>> while (true) >>>>>>>>> { >>>>>>>>> >>>>>>>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult >>>>>>>>> itemsResult = >>>>>>>>> >>>>>>>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest)); >>>>>>>>> >>>>>>>>> MessageElement[] itemsList = itemsResult.get_any(); >>>>>>>>> >>>>>>>>> if (Logging.connectors.isDebugEnabled()){ >>>>>>>>> Logging.connectors.debug("SharePoint: getChildren xml >>>>>>>>> response: " + itemsList[0].toString()); >>>>>>>>> } >>>>>>>>> >>>>>>>>> if (itemsList.length != 1) >>>>>>>>> throw new ManifoldCFException("Bad response - >>>>>>>>> expecting one outer 'GetListItems' node, saw >>>>>>>>> "+Integer.toString(itemsList.length)); >>>>>>>>> >>>>>>>>> MessageElement items = itemsList[0]; >>>>>>>>> if >>>>>>>>> (!items.getElementName().getLocalName().equals("GetListItems")) >>>>>>>>> throw new ManifoldCFException("Bad response - outer >>>>>>>>> node should have been 'GetListItems' node"); >>>>>>>>> >>>>>>>>> int resultCount = 0; >>>>>>>>> Iterator iter = items.getChildElements(); >>>>>>>>> while (iter.hasNext()) >>>>>>>>> { >>>>>>>>> MessageElement child = (MessageElement)iter.next(); >>>>>>>>> if >>>>>>>>> (child.getElementName().getLocalName().equals("GetListItemsResponse")) >>>>>>>>> { >>>>>>>>> Iterator resultIter = child.getChildElements(); >>>>>>>>> while (resultIter.hasNext()) >>>>>>>>> { >>>>>>>>> MessageElement result = >>>>>>>>> (MessageElement)resultIter.next(); >>>>>>>>> if >>>>>>>>> (result.getElementName().getLocalName().equals("GetListItemsResult")) >>>>>>>>> { >>>>>>>>> resultCount++; >>>>>>>>> String relPath = result.getAttribute("FileRef"); >>>>>>>>> String displayURL = >>>>>>>>> result.getAttribute("ListItemURL"); >>>>>>>>> fileStream.addFile( relPath, displayURL ); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> if (resultCount < amtToRequest) >>>>>>>>> break; >>>>>>>>> >>>>>>>>> startingIndex += resultCount; >>>>>>>>> } >>>>>>>>> <<<<<< >>>>>>>>> >>>>>>>>> What this does is request library content URLs in chunks of >>>>>>>>> 10000. It stops when it receives less than 10000 documents from any >>>>>>>>> one >>>>>>>>> request. >>>>>>>>> >>>>>>>>> If the documents were all in one library, then one call to the web >>>>>>>>> service yielded 10000 documents, and the second call yielded 10000 >>>>>>>>> documents, and there was no third call for no reason I can figure out. >>>>>>>>> Since 10000 documents were returned each time the loop ought to just >>>>>>>>> continue, unless there was some kind of error. Does the job succeed, >>>>>>>>> or >>>>>>>>> does it abort? >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <daddy...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> If you are using the MCF plugin, and selecting the appropriate >>>>>>>>>> version of Sharepoint in the connection configuration, there is no >>>>>>>>>> hard >>>>>>>>>> limit I'm aware of for any Sharepoint job. We have lots of other >>>>>>>>>> people >>>>>>>>>> using SharePoint and nobody has reported this ever before. >>>>>>>>>> >>>>>>>>>> If your SharePoint connection says "SharePoint 2003" as the >>>>>>>>>> SharePoint version, then sure, that would be expected behavior. So >>>>>>>>>> please >>>>>>>>>> check that first. >>>>>>>>>> >>>>>>>>>> The other question I have is your description of you first >>>>>>>>>> getting 10001 documents and then later 20002. That's not how >>>>>>>>>> ManifoldCF >>>>>>>>>> works. At the start of the crawl, seeds are added; this would start >>>>>>>>>> out >>>>>>>>>> just being the root, and then other documents would be discovered as >>>>>>>>>> the >>>>>>>>>> crawl proceeded, after subsites and libraries are discovered. So I >>>>>>>>>> am >>>>>>>>>> still trying to square that with your description of how this is >>>>>>>>>> working >>>>>>>>>> for you. >>>>>>>>>> >>>>>>>>>> Are all of your documents in one library? Or two libraries? >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia < >>>>>>>>>> jalon...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> On UI shows 20,002 documents (on a firts phase show 10,001,and >>>>>>>>>>> after sometime of process raise to 20,002) . >>>>>>>>>>> It looks like a hard limit, there is more files on sharepoint >>>>>>>>>>> with the used criteria >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Jorge Alonso Garcia >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (< >>>>>>>>>>> daddy...@gmail.com>) escribió: >>>>>>>>>>> >>>>>>>>>>>> Hi Jorge, >>>>>>>>>>>> >>>>>>>>>>>> When you run the job, do you see more than 20,000 documents as >>>>>>>>>>>> part of it? >>>>>>>>>>>> >>>>>>>>>>>> Do you see *exactly* 20,000 documents as part of it? >>>>>>>>>>>> >>>>>>>>>>>> Unless you are seeing a hard number like that in the UI for >>>>>>>>>>>> that job on the job status page, I doubt very much that the >>>>>>>>>>>> problem is a >>>>>>>>>>>> numerical limitation in the number of documents. I would suspect >>>>>>>>>>>> that the >>>>>>>>>>>> inclusion criteria, e.g. the mime type or maximum length, is >>>>>>>>>>>> excluding >>>>>>>>>>>> documents. >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia < >>>>>>>>>>>> jalon...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Karl, >>>>>>>>>>>>> We had installed the shaterpoint plugin, and access properly >>>>>>>>>>>>> http:/server/_vti_bin/MCPermissions.asmx >>>>>>>>>>>>> >>>>>>>>>>>>> [image: image.png] >>>>>>>>>>>>> >>>>>>>>>>>>> Sharepoint has more than 20,000 documents, but when execute >>>>>>>>>>>>> the jon only extract these 20,000. How Can I check where is the >>>>>>>>>>>>> issue? >>>>>>>>>>>>> >>>>>>>>>>>>> Regards >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Jorge Alonso Garcia >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (< >>>>>>>>>>>>> daddy...@gmail.com>) escribió: >>>>>>>>>>>>> >>>>>>>>>>>>>> By "stop at 20,000" do you mean that it finds more than >>>>>>>>>>>>>> 20,000 but stops crawling at that time? Or what exactly do you >>>>>>>>>>>>>> mean here? >>>>>>>>>>>>>> >>>>>>>>>>>>>> FWIW, the behavior you describe sounds like you may not have >>>>>>>>>>>>>> installed the SharePoint plugin and may have selected a version >>>>>>>>>>>>>> of >>>>>>>>>>>>>> SharePoint that is inappropriate. All SharePoint versions after >>>>>>>>>>>>>> 2008 limit >>>>>>>>>>>>>> the number of documents returned using the standard web services >>>>>>>>>>>>>> methods. >>>>>>>>>>>>>> The plugin allows us to bypass that hard limit. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia < >>>>>>>>>>>>>> jalon...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> We have an isuse with sharepoint connector. >>>>>>>>>>>>>>> There is a job that crawl a sharepoint 2016, but it is not >>>>>>>>>>>>>>> recovering all files, it stop at 20.000 documents without any >>>>>>>>>>>>>>> error. >>>>>>>>>>>>>>> Is there any parameter that should be change to avoid this >>>>>>>>>>>>>>> limitation? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>> Jorge Alonso Garcia >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>