Re: sharepoint crawler documents limit

Karl Wright Mon, 27 Jan 2020 03:58:25 -0800

I'm glad you got by this.  Thanks for letting us know what the issue was.
Karl


On Mon, Jan 27, 2020 at 4:05 AM Jorge Alonso Garcia <jalon...@gmail.com>
wrote:

> Hi,
> We had change timeout on sharepoint IIS and now the process is able to
> crall all documents.
> Thanks for your help
>
>
>
> El lun., 30 dic. 2019 a las 12:18, Gaurav G (<goyalgaur...@gmail.com>)
> escribió:
>
>> We had faced a similar issue, wherein our repo had 100,000 documents but
>> our crawler stopped after 50000 documents. The issue turned out to be that
>> the Sharepoint query that was fired by the Sharepoint web service gets
>> progressively slower and eventually the connection starts timing out before
>> the next 10000 records get returned. We increased a timeout parameter on
>> Sharepoint to 10 minutes and then after that we were able to crawl all
>> documents successfully.  I believe we had increased the parameter indicated
>> in the link below
>>
>>
>> https://weblogs.asp.net/jeffwids/how-to-increase-the-timeout-for-a-sharepoint-2010-website
>>
>>
>>
>> On Fri, Dec 20, 2019 at 6:27 PM Karl Wright <daddy...@gmail.com> wrote:
>>
>>> Hi Priya,
>>>
>>> This has nothing to do with anything in ManifoldCF.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Dec 20, 2019 at 7:56 AM Priya Arora <pr...@smartshore.nl> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Is this issue something to have with below value/parameters set in
>>>> properties.xml.
>>>> [image: image.png]
>>>>
>>>>
>>>> On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia <jalon...@gmail.com>
>>>> wrote:
>>>>
>>>>> And what other sharepoint parameter I could check?
>>>>>
>>>>> Jorge Alonso Garcia
>>>>>
>>>>>
>>>>>
>>>>> El vie., 20 dic. 2019 a las 12:47, Karl Wright (<daddy...@gmail.com>)
>>>>> escribió:
>>>>>
>>>>>> The code seems correct and many people are using it without
>>>>>> encountering this problem.  There may be another SharePoint configuration
>>>>>> parameter you also need to look at somewhere.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <
>>>>>> jalon...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Karl,
>>>>>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>>>>>> 20,000 from mcf
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>>
>>>>>>> Jorge Alonso Garcia
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright (<daddy...@gmail.com>)
>>>>>>> escribió:
>>>>>>>
>>>>>>>> If the job finished without error it implies that the number of
>>>>>>>> documents returned from this one library was 10000 when the service is
>>>>>>>> called the first time (starting at doc 0), 10000 when it's called the
>>>>>>>> second time (starting at doc 10000), and zero when it is called the 
>>>>>>>> third
>>>>>>>> time (starting at doc 20000).
>>>>>>>>
>>>>>>>> The plugin code is unremarkable and actually gets results in chunks
>>>>>>>> of 1000 under the covers:
>>>>>>>>
>>>>>>>> >>>>>>
>>>>>>>>                         SPQuery listQuery = new SPQuery();
>>>>>>>>                         listQuery.Query = "<OrderBy
>>>>>>>> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>";
>>>>>>>>                         listQuery.QueryThrottleMode =
>>>>>>>> SPQueryThrottleOption.Override;
>>>>>>>>                         listQuery.ViewAttributes =
>>>>>>>> "Scope=\"Recursive\"";
>>>>>>>>                         listQuery.ViewFields = "<FieldRef
>>>>>>>> Name='FileRef' />";
>>>>>>>>                         listQuery.RowLimit = 1000;
>>>>>>>>
>>>>>>>>                         XmlDocument doc = new XmlDocument();
>>>>>>>>                         retVal = doc.CreateElement("GetListItems",
>>>>>>>>                             "
>>>>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/";);
>>>>>>>>                         XmlNode getListItemsNode =
>>>>>>>> doc.CreateElement("GetListItemsResponse");
>>>>>>>>
>>>>>>>>                         uint counter = 0;
>>>>>>>>                         do
>>>>>>>>                         {
>>>>>>>>                             if (counter >= startRowParam +
>>>>>>>> rowLimitParam)
>>>>>>>>                                 break;
>>>>>>>>
>>>>>>>>                             SPListItemCollection collListItems =
>>>>>>>> oList.GetItems(listQuery);
>>>>>>>>
>>>>>>>>
>>>>>>>>                             foreach (SPListItem oListItem in
>>>>>>>> collListItems)
>>>>>>>>                             {
>>>>>>>>                                 if (counter >= startRowParam &&
>>>>>>>> counter < startRowParam + rowLimitParam)
>>>>>>>>                                 {
>>>>>>>>                                     XmlNode resultNode =
>>>>>>>> doc.CreateElement("GetListItemsResult");
>>>>>>>>                                     XmlAttribute idAttribute =
>>>>>>>> doc.CreateAttribute("FileRef");
>>>>>>>>                                     idAttribute.Value =
>>>>>>>> oListItem.Url;
>>>>>>>>
>>>>>>>> resultNode.Attributes.Append(idAttribute);
>>>>>>>>                                     XmlAttribute urlAttribute =
>>>>>>>> doc.CreateAttribute("ListItemURL");
>>>>>>>>                                     //urlAttribute.Value =
>>>>>>>> oListItem.ParentList.DefaultViewUrl;
>>>>>>>>                                     urlAttribute.Value =
>>>>>>>> string.Format("{0}?ID={1}",
>>>>>>>> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
>>>>>>>> oListItem.ID);
>>>>>>>>
>>>>>>>> resultNode.Attributes.Append(urlAttribute);
>>>>>>>>
>>>>>>>> getListItemsNode.AppendChild(resultNode);
>>>>>>>>                                 }
>>>>>>>>                                 counter++;
>>>>>>>>                             }
>>>>>>>>
>>>>>>>>                             listQuery.ListItemCollectionPosition =
>>>>>>>> collListItems.ListItemCollectionPosition;
>>>>>>>>
>>>>>>>>                         } while
>>>>>>>> (listQuery.ListItemCollectionPosition != null);
>>>>>>>>
>>>>>>>>                         retVal.AppendChild(getListItemsNode);
>>>>>>>> <<<<<<
>>>>>>>>
>>>>>>>> The code is clearly working if you get 20000 results returned, so I
>>>>>>>> submit that perhaps there's a configured limit in your SharePoint 
>>>>>>>> instance
>>>>>>>> that prevents listing more than 20000.  That's the only way I can 
>>>>>>>> explain
>>>>>>>> this.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <
>>>>>>>> jalon...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> The job finnish ok (several times) but always with this 20000
>>>>>>>>> documents, for some reason the loop only execute twice
>>>>>>>>>
>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<
>>>>>>>>> daddy...@gmail.com>) escribió:
>>>>>>>>>
>>>>>>>>>> If the are all in one document, then you'd be running this code:
>>>>>>>>>>
>>>>>>>>>> >>>>>>
>>>>>>>>>>         int startingIndex = 0;
>>>>>>>>>>         int amtToRequest = 10000;
>>>>>>>>>>         while (true)
>>>>>>>>>>         {
>>>>>>>>>>
>>>>>>>>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>>>>>>>>>> itemsResult =
>>>>>>>>>>
>>>>>>>>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>>>>>>>>>>
>>>>>>>>>>           MessageElement[] itemsList = itemsResult.get_any();
>>>>>>>>>>
>>>>>>>>>>           if (Logging.connectors.isDebugEnabled()){
>>>>>>>>>>             Logging.connectors.debug("SharePoint: getChildren xml
>>>>>>>>>> response: " + itemsList[0].toString());
>>>>>>>>>>           }
>>>>>>>>>>
>>>>>>>>>>           if (itemsList.length != 1)
>>>>>>>>>>             throw new ManifoldCFException("Bad response -
>>>>>>>>>> expecting one outer 'GetListItems' node, saw
>>>>>>>>>> "+Integer.toString(itemsList.length));
>>>>>>>>>>
>>>>>>>>>>           MessageElement items = itemsList[0];
>>>>>>>>>>           if
>>>>>>>>>> (!items.getElementName().getLocalName().equals("GetListItems"))
>>>>>>>>>>             throw new ManifoldCFException("Bad response - outer
>>>>>>>>>> node should have been 'GetListItems' node");
>>>>>>>>>>
>>>>>>>>>>           int resultCount = 0;
>>>>>>>>>>           Iterator iter = items.getChildElements();
>>>>>>>>>>           while (iter.hasNext())
>>>>>>>>>>           {
>>>>>>>>>>             MessageElement child = (MessageElement)iter.next();
>>>>>>>>>>             if
>>>>>>>>>> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>>>>>>>>>>             {
>>>>>>>>>>               Iterator resultIter = child.getChildElements();
>>>>>>>>>>               while (resultIter.hasNext())
>>>>>>>>>>               {
>>>>>>>>>>                 MessageElement result =
>>>>>>>>>> (MessageElement)resultIter.next();
>>>>>>>>>>                 if
>>>>>>>>>> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>>>>>>>>>>                 {
>>>>>>>>>>                   resultCount++;
>>>>>>>>>>                   String relPath = result.getAttribute("FileRef");
>>>>>>>>>>                   String displayURL =
>>>>>>>>>> result.getAttribute("ListItemURL");
>>>>>>>>>>                   fileStream.addFile( relPath, displayURL );
>>>>>>>>>>                 }
>>>>>>>>>>               }
>>>>>>>>>>
>>>>>>>>>>             }
>>>>>>>>>>           }
>>>>>>>>>>
>>>>>>>>>>           if (resultCount < amtToRequest)
>>>>>>>>>>             break;
>>>>>>>>>>
>>>>>>>>>>           startingIndex += resultCount;
>>>>>>>>>>         }
>>>>>>>>>> <<<<<<
>>>>>>>>>>
>>>>>>>>>> What this does is request library content URLs in chunks of
>>>>>>>>>> 10000.  It stops when it receives less than 10000 documents from any 
>>>>>>>>>> one
>>>>>>>>>> request.
>>>>>>>>>>
>>>>>>>>>> If the documents were all in one library, then one call to the
>>>>>>>>>> web service yielded 10000 documents, and the second call yielded 
>>>>>>>>>> 10000
>>>>>>>>>> documents, and there was no third call for no reason I can figure 
>>>>>>>>>> out.
>>>>>>>>>> Since 10000 documents were returned each time the loop ought to just
>>>>>>>>>> continue, unless there was some kind of error.  Does the job 
>>>>>>>>>> succeed, or
>>>>>>>>>> does it abort?
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <daddy...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> If you are using the MCF plugin, and selecting the appropriate
>>>>>>>>>>> version of Sharepoint in the connection configuration, there is no 
>>>>>>>>>>> hard
>>>>>>>>>>> limit I'm aware of for any Sharepoint job.  We have lots of other 
>>>>>>>>>>> people
>>>>>>>>>>> using SharePoint and nobody has reported this ever before.
>>>>>>>>>>>
>>>>>>>>>>> If your SharePoint connection says "SharePoint 2003" as the
>>>>>>>>>>> SharePoint version, then sure, that would be expected behavior.  So 
>>>>>>>>>>> please
>>>>>>>>>>> check that first.
>>>>>>>>>>>
>>>>>>>>>>> The other question I have is your description of you first
>>>>>>>>>>> getting 10001 documents and then later 20002.  That's not how 
>>>>>>>>>>> ManifoldCF
>>>>>>>>>>> works.  At the start of the crawl, seeds are added; this would 
>>>>>>>>>>> start out
>>>>>>>>>>> just being the root, and then other documents would be discovered 
>>>>>>>>>>> as the
>>>>>>>>>>> crawl proceeded, after subsites and libraries are discovered.  So I 
>>>>>>>>>>> am
>>>>>>>>>>> still trying to square that with your description of how this is 
>>>>>>>>>>> working
>>>>>>>>>>> for you.
>>>>>>>>>>>
>>>>>>>>>>> Are all of your documents in one library?  Or two libraries?
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia <
>>>>>>>>>>> jalon...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> On UI shows 20,002 documents (on a firts phase show 10,001,and
>>>>>>>>>>>> after sometime of process raise to 20,002) .
>>>>>>>>>>>> It looks like a hard limit, there is more files on sharepoint
>>>>>>>>>>>> with the used criteria
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (<
>>>>>>>>>>>> daddy...@gmail.com>) escribió:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Jorge,
>>>>>>>>>>>>>
>>>>>>>>>>>>> When you run the job, do you see more than 20,000 documents as
>>>>>>>>>>>>> part of it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you see *exactly* 20,000 documents as part of it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unless you are seeing a hard number like that in the UI for
>>>>>>>>>>>>> that job on the job status page, I doubt very much that the 
>>>>>>>>>>>>> problem is a
>>>>>>>>>>>>> numerical limitation in the number of documents.  I would suspect 
>>>>>>>>>>>>> that the
>>>>>>>>>>>>> inclusion criteria, e.g. the mime type or maximum length, is 
>>>>>>>>>>>>> excluding
>>>>>>>>>>>>> documents.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia <
>>>>>>>>>>>>> jalon...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>> We had installed the shaterpoint plugin, and access properly
>>>>>>>>>>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sharepoint has more than 20,000 documents, but when execute
>>>>>>>>>>>>>> the jon only extract these 20,000. How Can I check where is the 
>>>>>>>>>>>>>> issue?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (<
>>>>>>>>>>>>>> daddy...@gmail.com>) escribió:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> By "stop at 20,000" do you mean that it finds more than
>>>>>>>>>>>>>>> 20,000 but stops crawling at that time?  Or what exactly do you 
>>>>>>>>>>>>>>> mean here?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FWIW, the behavior you describe sounds like you may not have
>>>>>>>>>>>>>>> installed the SharePoint plugin and may have selected a version 
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> SharePoint that is inappropriate.  All SharePoint versions 
>>>>>>>>>>>>>>> after 2008 limit
>>>>>>>>>>>>>>> the number of documents returned using the standard web 
>>>>>>>>>>>>>>> services methods.
>>>>>>>>>>>>>>> The plugin allows us to bypass that hard limit.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia <
>>>>>>>>>>>>>>> jalon...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> We have an isuse with sharepoint connector.
>>>>>>>>>>>>>>>> There is a job that crawl a sharepoint 2016, but it is not
>>>>>>>>>>>>>>>> recovering all files, it stop at 20.000 documents without any 
>>>>>>>>>>>>>>>> error.
>>>>>>>>>>>>>>>> Is there any parameter that should be change to avoid this
>>>>>>>>>>>>>>>> limitation?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: sharepoint crawler documents limit

Reply via email to