Re: sharepoint crawler documents limit

Priya Arora Fri, 20 Dec 2019 04:57:00 -0800

Hi All,

Is this issue something to have with below value/parameters set in
properties.xml.
[image: image.png]



On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia <[email protected]>
wrote:

> And what other sharepoint parameter I could check?
>
> Jorge Alonso Garcia
>
>
>
> El vie., 20 dic. 2019 a las 12:47, Karl Wright (<[email protected]>)
> escribió:
>
>> The code seems correct and many people are using it without encountering
>> this problem.  There may be another SharePoint configuration parameter you
>> also need to look at somewhere.
>>
>> Karl
>>
>>
>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <[email protected]>
>> wrote:
>>
>>>
>>> Hi Karl,
>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>> 20,000 from mcf
>>> [image: image.png]
>>>
>>>
>>> Jorge Alonso Garcia
>>>
>>>
>>>
>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright (<[email protected]>)
>>> escribió:
>>>
>>>> If the job finished without error it implies that the number of
>>>> documents returned from this one library was 10000 when the service is
>>>> called the first time (starting at doc 0), 10000 when it's called the
>>>> second time (starting at doc 10000), and zero when it is called the third
>>>> time (starting at doc 20000).
>>>>
>>>> The plugin code is unremarkable and actually gets results in chunks of
>>>> 1000 under the covers:
>>>>
>>>> >>>>>>
>>>>                         SPQuery listQuery = new SPQuery();
>>>>                         listQuery.Query = "<OrderBy
>>>> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>";
>>>>                         listQuery.QueryThrottleMode =
>>>> SPQueryThrottleOption.Override;
>>>>                         listQuery.ViewAttributes =
>>>> "Scope=\"Recursive\"";
>>>>                         listQuery.ViewFields = "<FieldRef
>>>> Name='FileRef' />";
>>>>                         listQuery.RowLimit = 1000;
>>>>
>>>>                         XmlDocument doc = new XmlDocument();
>>>>                         retVal = doc.CreateElement("GetListItems",
>>>>                             "
>>>> http://schemas.microsoft.com/sharepoint/soap/directory/";);
>>>>                         XmlNode getListItemsNode =
>>>> doc.CreateElement("GetListItemsResponse");
>>>>
>>>>                         uint counter = 0;
>>>>                         do
>>>>                         {
>>>>                             if (counter >= startRowParam +
>>>> rowLimitParam)
>>>>                                 break;
>>>>
>>>>                             SPListItemCollection collListItems =
>>>> oList.GetItems(listQuery);
>>>>
>>>>
>>>>                             foreach (SPListItem oListItem in
>>>> collListItems)
>>>>                             {
>>>>                                 if (counter >= startRowParam && counter
>>>> < startRowParam + rowLimitParam)
>>>>                                 {
>>>>                                     XmlNode resultNode =
>>>> doc.CreateElement("GetListItemsResult");
>>>>                                     XmlAttribute idAttribute =
>>>> doc.CreateAttribute("FileRef");
>>>>                                     idAttribute.Value = oListItem.Url;
>>>>
>>>> resultNode.Attributes.Append(idAttribute);
>>>>                                     XmlAttribute urlAttribute =
>>>> doc.CreateAttribute("ListItemURL");
>>>>                                     //urlAttribute.Value =
>>>> oListItem.ParentList.DefaultViewUrl;
>>>>                                     urlAttribute.Value =
>>>> string.Format("{0}?ID={1}",
>>>> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
>>>> oListItem.ID);
>>>>
>>>> resultNode.Attributes.Append(urlAttribute);
>>>>
>>>> getListItemsNode.AppendChild(resultNode);
>>>>                                 }
>>>>                                 counter++;
>>>>                             }
>>>>
>>>>                             listQuery.ListItemCollectionPosition =
>>>> collListItems.ListItemCollectionPosition;
>>>>
>>>>                         } while (listQuery.ListItemCollectionPosition
>>>> != null);
>>>>
>>>>                         retVal.AppendChild(getListItemsNode);
>>>> <<<<<<
>>>>
>>>> The code is clearly working if you get 20000 results returned, so I
>>>> submit that perhaps there's a configured limit in your SharePoint instance
>>>> that prevents listing more than 20000.  That's the only way I can explain
>>>> this.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>> The job finnish ok (several times) but always with this 20000
>>>>> documents, for some reason the loop only execute twice
>>>>>
>>>>> Jorge Alonso Garcia
>>>>>
>>>>>
>>>>>
>>>>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<[email protected]>)
>>>>> escribió:
>>>>>
>>>>>> If the are all in one document, then you'd be running this code:
>>>>>>
>>>>>> >>>>>>
>>>>>>         int startingIndex = 0;
>>>>>>         int amtToRequest = 10000;
>>>>>>         while (true)
>>>>>>         {
>>>>>>
>>>>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>>>>>> itemsResult =
>>>>>>
>>>>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>>>>>>
>>>>>>           MessageElement[] itemsList = itemsResult.get_any();
>>>>>>
>>>>>>           if (Logging.connectors.isDebugEnabled()){
>>>>>>             Logging.connectors.debug("SharePoint: getChildren xml
>>>>>> response: " + itemsList[0].toString());
>>>>>>           }
>>>>>>
>>>>>>           if (itemsList.length != 1)
>>>>>>             throw new ManifoldCFException("Bad response - expecting
>>>>>> one outer 'GetListItems' node, saw "+Integer.toString(itemsList.length));
>>>>>>
>>>>>>           MessageElement items = itemsList[0];
>>>>>>           if
>>>>>> (!items.getElementName().getLocalName().equals("GetListItems"))
>>>>>>             throw new ManifoldCFException("Bad response - outer node
>>>>>> should have been 'GetListItems' node");
>>>>>>
>>>>>>           int resultCount = 0;
>>>>>>           Iterator iter = items.getChildElements();
>>>>>>           while (iter.hasNext())
>>>>>>           {
>>>>>>             MessageElement child = (MessageElement)iter.next();
>>>>>>             if
>>>>>> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>>>>>>             {
>>>>>>               Iterator resultIter = child.getChildElements();
>>>>>>               while (resultIter.hasNext())
>>>>>>               {
>>>>>>                 MessageElement result =
>>>>>> (MessageElement)resultIter.next();
>>>>>>                 if
>>>>>> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>>>>>>                 {
>>>>>>                   resultCount++;
>>>>>>                   String relPath = result.getAttribute("FileRef");
>>>>>>                   String displayURL =
>>>>>> result.getAttribute("ListItemURL");
>>>>>>                   fileStream.addFile( relPath, displayURL );
>>>>>>                 }
>>>>>>               }
>>>>>>
>>>>>>             }
>>>>>>           }
>>>>>>
>>>>>>           if (resultCount < amtToRequest)
>>>>>>             break;
>>>>>>
>>>>>>           startingIndex += resultCount;
>>>>>>         }
>>>>>> <<<<<<
>>>>>>
>>>>>> What this does is request library content URLs in chunks of 10000.
>>>>>> It stops when it receives less than 10000 documents from any one request.
>>>>>>
>>>>>> If the documents were all in one library, then one call to the web
>>>>>> service yielded 10000 documents, and the second call yielded 10000
>>>>>> documents, and there was no third call for no reason I can figure out.
>>>>>> Since 10000 documents were returned each time the loop ought to just
>>>>>> continue, unless there was some kind of error.  Does the job succeed, or
>>>>>> does it abort?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> If you are using the MCF plugin, and selecting the appropriate
>>>>>>> version of Sharepoint in the connection configuration, there is no hard
>>>>>>> limit I'm aware of for any Sharepoint job.  We have lots of other people
>>>>>>> using SharePoint and nobody has reported this ever before.
>>>>>>>
>>>>>>> If your SharePoint connection says "SharePoint 2003" as the
>>>>>>> SharePoint version, then sure, that would be expected behavior.  So 
>>>>>>> please
>>>>>>> check that first.
>>>>>>>
>>>>>>> The other question I have is your description of you first getting
>>>>>>> 10001 documents and then later 20002.  That's not how ManifoldCF works. 
>>>>>>>  At
>>>>>>> the start of the crawl, seeds are added; this would start out just being
>>>>>>> the root, and then other documents would be discovered as the crawl
>>>>>>> proceeded, after subsites and libraries are discovered.  So I am still
>>>>>>> trying to square that with your description of how this is working for 
>>>>>>> you.
>>>>>>>
>>>>>>> Are all of your documents in one library?  Or two libraries?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> On UI shows 20,002 documents (on a firts phase show 10,001,and
>>>>>>>> after sometime of process raise to 20,002) .
>>>>>>>> It looks like a hard limit, there is more files on sharepoint with
>>>>>>>> the used criteria
>>>>>>>>
>>>>>>>>
>>>>>>>> Jorge Alonso Garcia
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (<[email protected]>)
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> Hi Jorge,
>>>>>>>>>
>>>>>>>>> When you run the job, do you see more than 20,000 documents as
>>>>>>>>> part of it?
>>>>>>>>>
>>>>>>>>> Do you see *exactly* 20,000 documents as part of it?
>>>>>>>>>
>>>>>>>>> Unless you are seeing a hard number like that in the UI for that
>>>>>>>>> job on the job status page, I doubt very much that the problem is a
>>>>>>>>> numerical limitation in the number of documents.  I would suspect 
>>>>>>>>> that the
>>>>>>>>> inclusion criteria, e.g. the mime type or maximum length, is excluding
>>>>>>>>> documents.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl,
>>>>>>>>>> We had installed the shaterpoint plugin, and access properly
>>>>>>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>>>>>>
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>> Sharepoint has more than 20,000 documents, but when execute the
>>>>>>>>>> jon only extract these 20,000. How Can I check where is the issue?
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (<
>>>>>>>>>> [email protected]>) escribió:
>>>>>>>>>>
>>>>>>>>>>> By "stop at 20,000" do you mean that it finds more than 20,000
>>>>>>>>>>> but stops crawling at that time?  Or what exactly do you mean here?
>>>>>>>>>>>
>>>>>>>>>>> FWIW, the behavior you describe sounds like you may not have
>>>>>>>>>>> installed the SharePoint plugin and may have selected a version of
>>>>>>>>>>> SharePoint that is inappropriate.  All SharePoint versions after 
>>>>>>>>>>> 2008 limit
>>>>>>>>>>> the number of documents returned using the standard web services 
>>>>>>>>>>> methods.
>>>>>>>>>>> The plugin allows us to bypass that hard limit.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> We have an isuse with sharepoint connector.
>>>>>>>>>>>> There is a job that crawl a sharepoint 2016, but it is not
>>>>>>>>>>>> recovering all files, it stop at 20.000 documents without any 
>>>>>>>>>>>> error.
>>>>>>>>>>>> Is there any parameter that should be change to avoid this
>>>>>>>>>>>> limitation?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: sharepoint crawler documents limit

Reply via email to