Re: sharepoint crawler documents limit

Karl Wright Fri, 20 Dec 2019 04:58:00 -0800

Hi Priya,

This has nothing to do with anything in ManifoldCF.


Karl


On Fri, Dec 20, 2019 at 7:56 AM Priya Arora <pr...@smartshore.nl> wrote:

> Hi All,
>
> Is this issue something to have with below value/parameters set in
> properties.xml.
> [image: image.png]
>
>
> On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia <jalon...@gmail.com>
> wrote:
>
>> And what other sharepoint parameter I could check?
>>
>> Jorge Alonso Garcia
>>
>>
>>
>> El vie., 20 dic. 2019 a las 12:47, Karl Wright (<daddy...@gmail.com>)
>> escribió:
>>
>>> The code seems correct and many people are using it without encountering
>>> this problem.  There may be another SharePoint configuration parameter you
>>> also need to look at somewhere.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia <jalon...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi Karl,
>>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>>> 20,000 from mcf
>>>> [image: image.png]
>>>>
>>>>
>>>> Jorge Alonso Garcia
>>>>
>>>>
>>>>
>>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright (<daddy...@gmail.com>)
>>>> escribió:
>>>>
>>>>> If the job finished without error it implies that the number of
>>>>> documents returned from this one library was 10000 when the service is
>>>>> called the first time (starting at doc 0), 10000 when it's called the
>>>>> second time (starting at doc 10000), and zero when it is called the third
>>>>> time (starting at doc 20000).
>>>>>
>>>>> The plugin code is unremarkable and actually gets results in chunks of
>>>>> 1000 under the covers:
>>>>>
>>>>> >>>>>>
>>>>>                         SPQuery listQuery = new SPQuery();
>>>>>                         listQuery.Query = "<OrderBy
>>>>> Override=\"TRUE\"><FieldRef Name=\"FileRef\" /></OrderBy>";
>>>>>                         listQuery.QueryThrottleMode =
>>>>> SPQueryThrottleOption.Override;
>>>>>                         listQuery.ViewAttributes =
>>>>> "Scope=\"Recursive\"";
>>>>>                         listQuery.ViewFields = "<FieldRef
>>>>> Name='FileRef' />";
>>>>>                         listQuery.RowLimit = 1000;
>>>>>
>>>>>                         XmlDocument doc = new XmlDocument();
>>>>>                         retVal = doc.CreateElement("GetListItems",
>>>>>                             "
>>>>> http://schemas.microsoft.com/sharepoint/soap/directory/";);
>>>>>                         XmlNode getListItemsNode =
>>>>> doc.CreateElement("GetListItemsResponse");
>>>>>
>>>>>                         uint counter = 0;
>>>>>                         do
>>>>>                         {
>>>>>                             if (counter >= startRowParam +
>>>>> rowLimitParam)
>>>>>                                 break;
>>>>>
>>>>>                             SPListItemCollection collListItems =
>>>>> oList.GetItems(listQuery);
>>>>>
>>>>>
>>>>>                             foreach (SPListItem oListItem in
>>>>> collListItems)
>>>>>                             {
>>>>>                                 if (counter >= startRowParam &&
>>>>> counter < startRowParam + rowLimitParam)
>>>>>                                 {
>>>>>                                     XmlNode resultNode =
>>>>> doc.CreateElement("GetListItemsResult");
>>>>>                                     XmlAttribute idAttribute =
>>>>> doc.CreateAttribute("FileRef");
>>>>>                                     idAttribute.Value = oListItem.Url;
>>>>>
>>>>> resultNode.Attributes.Append(idAttribute);
>>>>>                                     XmlAttribute urlAttribute =
>>>>> doc.CreateAttribute("ListItemURL");
>>>>>                                     //urlAttribute.Value =
>>>>> oListItem.ParentList.DefaultViewUrl;
>>>>>                                     urlAttribute.Value =
>>>>> string.Format("{0}?ID={1}",
>>>>> oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
>>>>> oListItem.ID);
>>>>>
>>>>> resultNode.Attributes.Append(urlAttribute);
>>>>>
>>>>> getListItemsNode.AppendChild(resultNode);
>>>>>                                 }
>>>>>                                 counter++;
>>>>>                             }
>>>>>
>>>>>                             listQuery.ListItemCollectionPosition =
>>>>> collListItems.ListItemCollectionPosition;
>>>>>
>>>>>                         } while (listQuery.ListItemCollectionPosition
>>>>> != null);
>>>>>
>>>>>                         retVal.AppendChild(getListItemsNode);
>>>>> <<<<<<
>>>>>
>>>>> The code is clearly working if you get 20000 results returned, so I
>>>>> submit that perhaps there's a configured limit in your SharePoint instance
>>>>> that prevents listing more than 20000.  That's the only way I can explain
>>>>> this.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <
>>>>> jalon...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> The job finnish ok (several times) but always with this 20000
>>>>>> documents, for some reason the loop only execute twice
>>>>>>
>>>>>> Jorge Alonso Garcia
>>>>>>
>>>>>>
>>>>>>
>>>>>> El jue., 19 dic. 2019 a las 18:14, Karl Wright (<daddy...@gmail.com>)
>>>>>> escribió:
>>>>>>
>>>>>>> If the are all in one document, then you'd be running this code:
>>>>>>>
>>>>>>> >>>>>>
>>>>>>>         int startingIndex = 0;
>>>>>>>         int amtToRequest = 10000;
>>>>>>>         while (true)
>>>>>>>         {
>>>>>>>
>>>>>>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>>>>>>> itemsResult =
>>>>>>>
>>>>>>> itemCall.getListItems(guid,Integer.toString(startingIndex),Integer.toString(amtToRequest));
>>>>>>>
>>>>>>>           MessageElement[] itemsList = itemsResult.get_any();
>>>>>>>
>>>>>>>           if (Logging.connectors.isDebugEnabled()){
>>>>>>>             Logging.connectors.debug("SharePoint: getChildren xml
>>>>>>> response: " + itemsList[0].toString());
>>>>>>>           }
>>>>>>>
>>>>>>>           if (itemsList.length != 1)
>>>>>>>             throw new ManifoldCFException("Bad response - expecting
>>>>>>> one outer 'GetListItems' node, saw 
>>>>>>> "+Integer.toString(itemsList.length));
>>>>>>>
>>>>>>>           MessageElement items = itemsList[0];
>>>>>>>           if
>>>>>>> (!items.getElementName().getLocalName().equals("GetListItems"))
>>>>>>>             throw new ManifoldCFException("Bad response - outer node
>>>>>>> should have been 'GetListItems' node");
>>>>>>>
>>>>>>>           int resultCount = 0;
>>>>>>>           Iterator iter = items.getChildElements();
>>>>>>>           while (iter.hasNext())
>>>>>>>           {
>>>>>>>             MessageElement child = (MessageElement)iter.next();
>>>>>>>             if
>>>>>>> (child.getElementName().getLocalName().equals("GetListItemsResponse"))
>>>>>>>             {
>>>>>>>               Iterator resultIter = child.getChildElements();
>>>>>>>               while (resultIter.hasNext())
>>>>>>>               {
>>>>>>>                 MessageElement result =
>>>>>>> (MessageElement)resultIter.next();
>>>>>>>                 if
>>>>>>> (result.getElementName().getLocalName().equals("GetListItemsResult"))
>>>>>>>                 {
>>>>>>>                   resultCount++;
>>>>>>>                   String relPath = result.getAttribute("FileRef");
>>>>>>>                   String displayURL =
>>>>>>> result.getAttribute("ListItemURL");
>>>>>>>                   fileStream.addFile( relPath, displayURL );
>>>>>>>                 }
>>>>>>>               }
>>>>>>>
>>>>>>>             }
>>>>>>>           }
>>>>>>>
>>>>>>>           if (resultCount < amtToRequest)
>>>>>>>             break;
>>>>>>>
>>>>>>>           startingIndex += resultCount;
>>>>>>>         }
>>>>>>> <<<<<<
>>>>>>>
>>>>>>> What this does is request library content URLs in chunks of 10000.
>>>>>>> It stops when it receives less than 10000 documents from any one 
>>>>>>> request.
>>>>>>>
>>>>>>> If the documents were all in one library, then one call to the web
>>>>>>> service yielded 10000 documents, and the second call yielded 10000
>>>>>>> documents, and there was no third call for no reason I can figure out.
>>>>>>> Since 10000 documents were returned each time the loop ought to just
>>>>>>> continue, unless there was some kind of error.  Does the job succeed, or
>>>>>>> does it abort?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 19, 2019 at 12:05 PM Karl Wright <daddy...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If you are using the MCF plugin, and selecting the appropriate
>>>>>>>> version of Sharepoint in the connection configuration, there is no hard
>>>>>>>> limit I'm aware of for any Sharepoint job.  We have lots of other 
>>>>>>>> people
>>>>>>>> using SharePoint and nobody has reported this ever before.
>>>>>>>>
>>>>>>>> If your SharePoint connection says "SharePoint 2003" as the
>>>>>>>> SharePoint version, then sure, that would be expected behavior.  So 
>>>>>>>> please
>>>>>>>> check that first.
>>>>>>>>
>>>>>>>> The other question I have is your description of you first getting
>>>>>>>> 10001 documents and then later 20002.  That's not how ManifoldCF 
>>>>>>>> works.  At
>>>>>>>> the start of the crawl, seeds are added; this would start out just 
>>>>>>>> being
>>>>>>>> the root, and then other documents would be discovered as the crawl
>>>>>>>> proceeded, after subsites and libraries are discovered.  So I am still
>>>>>>>> trying to square that with your description of how this is working for 
>>>>>>>> you.
>>>>>>>>
>>>>>>>> Are all of your documents in one library?  Or two libraries?
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 19, 2019 at 11:42 AM Jorge Alonso Garcia <
>>>>>>>> jalon...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> On UI shows 20,002 documents (on a firts phase show 10,001,and
>>>>>>>>> after sometime of process raise to 20,002) .
>>>>>>>>> It looks like a hard limit, there is more files on sharepoint with
>>>>>>>>> the used criteria
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> El jue., 19 dic. 2019 a las 16:05, Karl Wright (<
>>>>>>>>> daddy...@gmail.com>) escribió:
>>>>>>>>>
>>>>>>>>>> Hi Jorge,
>>>>>>>>>>
>>>>>>>>>> When you run the job, do you see more than 20,000 documents as
>>>>>>>>>> part of it?
>>>>>>>>>>
>>>>>>>>>> Do you see *exactly* 20,000 documents as part of it?
>>>>>>>>>>
>>>>>>>>>> Unless you are seeing a hard number like that in the UI for that
>>>>>>>>>> job on the job status page, I doubt very much that the problem is a
>>>>>>>>>> numerical limitation in the number of documents.  I would suspect 
>>>>>>>>>> that the
>>>>>>>>>> inclusion criteria, e.g. the mime type or maximum length, is 
>>>>>>>>>> excluding
>>>>>>>>>> documents.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 19, 2019 at 8:51 AM Jorge Alonso Garcia <
>>>>>>>>>> jalon...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>> We had installed the shaterpoint plugin, and access properly
>>>>>>>>>>> http:/server/_vti_bin/MCPermissions.asmx
>>>>>>>>>>>
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>
>>>>>>>>>>> Sharepoint has more than 20,000 documents, but when execute the
>>>>>>>>>>> jon only extract these 20,000. How Can I check where is the issue?
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> El jue., 19 dic. 2019 a las 12:52, Karl Wright (<
>>>>>>>>>>> daddy...@gmail.com>) escribió:
>>>>>>>>>>>
>>>>>>>>>>>> By "stop at 20,000" do you mean that it finds more than 20,000
>>>>>>>>>>>> but stops crawling at that time?  Or what exactly do you mean here?
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW, the behavior you describe sounds like you may not have
>>>>>>>>>>>> installed the SharePoint plugin and may have selected a version of
>>>>>>>>>>>> SharePoint that is inappropriate.  All SharePoint versions after 
>>>>>>>>>>>> 2008 limit
>>>>>>>>>>>> the number of documents returned using the standard web services 
>>>>>>>>>>>> methods.
>>>>>>>>>>>> The plugin allows us to bypass that hard limit.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 19, 2019 at 6:37 AM Jorge Alonso Garcia <
>>>>>>>>>>>> jalon...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> We have an isuse with sharepoint connector.
>>>>>>>>>>>>> There is a job that crawl a sharepoint 2016, but it is not
>>>>>>>>>>>>> recovering all files, it stop at 20.000 documents without any 
>>>>>>>>>>>>> error.
>>>>>>>>>>>>> Is there any parameter that should be change to avoid this
>>>>>>>>>>>>> limitation?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> Jorge Alonso Garcia
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: sharepoint crawler documents limit

Reply via email to