Hi Dmitry, Are you sure these are the right logs? - They start right in the middle of a crawl - They are already in a broken state when they start, e.g. the kinds of things that are being looked up are already nonsense paths
I need to see logs from the BEGINNING of a fresh crawl to see how the nonsense paths happen. Thanks, Karl On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg <[email protected]>wrote: > Karl, > > I've generated logs with details as we discussed. > > The job was created afresh, as before: > Path rules: > /* file include > /* library include > /* list include > /* site include > Metadata: > /* include true > The logs are attached. > - Dmitry > > On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright <[email protected]> wrote: > >> "Do you think that this issue is generic with regard to any Amz instance?" >> >> I presume so, since you didn't apparently do anything special to set one >> of these up. Unfortunately, such instances are not part of the free tier, >> so I am still constrained from setting one up for myself because of >> household rules here. >> >> "For now, I assume our only workaround is to list the paths of interest >> manually" >> >> Depending on what is going wrong, that may not even work. It looks like >> several SharePoint web service calls may be affected, and not in a cleanly >> predictable way, for this to happen. >> >> "is identification and extraction of attachments supported in the SP >> connector?" >> >> ManifoldCF in general leaves identification and extraction to the search >> engine. Solr, for instance uses Tika for this, if so configured. You can >> configure your Solr output connection to include or exclude specific mime >> types or extensions if you want to limit what is attempted. >> >> Karl >> >> >> >> >> >> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry Goldenberg < >> [email protected]> wrote: >> >>> Thanks, Karl. Do you think that this issue is generic with regard to any >>> Amz instance? I'm just wondering how easily reproducible this may be.. >>> >>> For now, I assume our only workaround is to list the paths of interest >>> manually, i.e. add explicit rules for each library and list. >>> >>> A related subject - is identification and extraction of attachments >>> supported in the SP connector? E.g. if I have a Word doc attached to a >>> Task list item, would that be extracted? So far, I see that library >>> content gets crawled and I'm getting the list item data but am not sure >>> what happens to the attachments. >>> >>> >>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <[email protected]>wrote: >>> >>>> Hi Dmitry, >>>> >>>> Thanks for the additional information. It does appear like the method >>>> that lists subsites is not working as expected under AWS. Nor are some >>>> number of other methods which supposedly just list the children of a >>>> subsite. >>>> >>>> I've reopened CONNECTORS-772 to work on addressing this issue. Please >>>> stay tuned. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg < >>>> [email protected]> wrote: >>>> >>>>> Hi Karl, >>>>> >>>>> Most of the paths that get generated are listed in the attached log, >>>>> they match what shows up in the diag report. So I'm not sure where they >>>>> diverge, most of them just don't seem right. There are 3 subsites rooted >>>>> in the main site: Abcd, Defghij, Klmnopqr. It's strange that the >>>>> connector >>>>> would try such paths as: >>>>> >>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// -- there are >>>>> multiple repetitions of the same subsite on the path and to begin with, >>>>> Defghij is not a subsite of Klmnopqr, so why would it try this? the /// at >>>>> the end doesn't seem correct either, unless I'm missing something in how >>>>> this pathing works. >>>>> >>>>> /Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements -- >>>>> looks wrong. A docname is mixed into the path, a subsite ends up after a >>>>> docname?... >>>>> >>>>> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- >>>>> same types of issues plus now somehow the docname got split with a forward >>>>> slash?.. >>>>> >>>>> There are also a bunch of StringIndexOutOfBoundsException's. Perhaps >>>>> this logic doesn't fit with the pathing we're seeing on this amz-based >>>>> installation? >>>>> >>>>> I'd expect the logic to just know that root contains 3 subsites, and >>>>> work off that. Each subsite has a specific list of libraries and lists, >>>>> etc. It seems odd that the connector gets into this matching pattern, and >>>>> tries what looks like thousands of variations (I aborted the execution). >>>>> >>>>> - Dmitry >>>>> >>>>> >>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <[email protected]>wrote: >>>>> >>>>>> Hi Dmitry, >>>>>> >>>>>> To clarify, the way you would need to analyze this is to run a crawl >>>>>> with the wildcards as you have selected, abort if necessary after a >>>>>> while, >>>>>> and then use the Document Status report to list the document identifiers >>>>>> that had been generated. Find a document identifier that you believe >>>>>> represents a path that is illegal, and figure out what SOAP getChild call >>>>>> caused the problem by returning incorrect data. In other words, find the >>>>>> point in the path where the path diverges from what exists into what >>>>>> doesn't exist, and go back in the ManifoldCF logs to find the particular >>>>>> SOAP request that led to the issue. >>>>>> >>>>>> I'd expect from your description that the problem lies with getting >>>>>> child sites given a site path, but that's just a guess at this point. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <[email protected]>wrote: >>>>>> >>>>>>> Hi Dmitry, >>>>>>> >>>>>>> I don't understand what you mean by "I've tried the set of wildcards >>>>>>> as below and I seem to be running into a lot of cycles, where various >>>>>>> subsite folders are appended to each other and an extraction of data at >>>>>>> all >>>>>>> of those locations is attempted". If you are seeing cycles it means >>>>>>> that >>>>>>> document discovery is still failing in some way. For each >>>>>>> folder/library/site/subsite, only the children of that >>>>>>> folder/library/site/subsite should be appended to the path - ever. >>>>>>> >>>>>>> If you can give a specific example, preferably including the soap >>>>>>> back-and-forth, that would be very helpful. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> >>>>>>>> Quick question. Is there an easy way to configure an SP repo >>>>>>>> connection for crawling of all content, from the root site all the way >>>>>>>> down? >>>>>>>> >>>>>>>> I've tried the set of wildcards as below and I seem to be running >>>>>>>> into a lot of cycles, where various subsite folders are appended to >>>>>>>> each >>>>>>>> other and an extraction of data at all of those locations is attempted. >>>>>>>> Ideally I'd like to avoid having to construct an exact set of paths >>>>>>>> because >>>>>>>> the set may change, especially with new content being added. >>>>>>>> >>>>>>>> Path rules: >>>>>>>> /* file include >>>>>>>> /* library include >>>>>>>> /* list include >>>>>>>> /* site include >>>>>>>> >>>>>>>> Metadata: >>>>>>>> /* include true >>>>>>>> >>>>>>>> I'd also like to pull down any files attached to list items. I'm >>>>>>>> hoping that some type of "/* file include" should do it, once I figure >>>>>>>> out >>>>>>>> how to safely include all content. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> - Dmitry >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
