Hi Dmitry,

Are you sure these are the right logs?
- They start right in the middle of a crawl
- They are already in a broken state when they start, e.g. the kinds of
things that are being looked up are already nonsense paths

I need to see logs from the BEGINNING of a fresh crawl to see how the
nonsense paths happen.

Thanks,
Karl




On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg
<[email protected]>wrote:

> Karl,
>
> I've generated logs with details as we discussed.
>
> The job was created afresh, as before:
> Path rules:
> /* file include
> /* library include
> /* list include
> /* site include
> Metadata:
> /* include true
> The logs are attached.
> - Dmitry
>
> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright <[email protected]> wrote:
>
>> "Do you think that this issue is generic with regard to any Amz instance?"
>>
>> I presume so, since you didn't apparently do anything special to set one
>> of these up.  Unfortunately, such instances are not part of the free tier,
>> so I am still constrained from setting one up for myself because of
>> household rules here.
>>
>> "For now, I assume our only workaround is to list the paths of interest
>> manually"
>>
>> Depending on what is going wrong, that may not even work.  It looks like
>> several SharePoint web service calls may be affected, and not in a cleanly
>> predictable way, for this to happen.
>>
>> "is identification and extraction of attachments supported in the SP
>> connector?"
>>
>> ManifoldCF in general leaves identification and extraction to the search
>> engine.  Solr, for instance uses Tika for this, if so configured.  You can
>> configure your Solr output connection to include or exclude specific mime
>> types or extensions if you want to limit what is attempted.
>>
>> Karl
>>
>>
>>
>>
>>
>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry Goldenberg <
>> [email protected]> wrote:
>>
>>> Thanks, Karl. Do you think that this issue is generic with regard to any
>>> Amz instance? I'm just wondering how easily reproducible this may be..
>>>
>>> For now, I assume our only workaround is to list the paths of interest
>>> manually, i.e. add explicit rules for each library and list.
>>>
>>> A related subject - is identification and extraction of attachments
>>> supported in the SP connector?  E.g. if I have a Word doc attached to a
>>> Task list item, would that be extracted?  So far, I see that library
>>> content gets crawled and I'm getting the list item data but am not sure
>>> what happens to the attachments.
>>>
>>>
>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <[email protected]>wrote:
>>>
>>>> Hi Dmitry,
>>>>
>>>> Thanks for the additional information.  It does appear like the method
>>>> that lists subsites is not working as expected under AWS.  Nor are some
>>>> number of other methods which supposedly just list the children of a
>>>> subsite.
>>>>
>>>> I've reopened CONNECTORS-772 to work on addressing this issue.  Please
>>>> stay tuned.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Most of the paths that get generated are listed in the attached log,
>>>>> they match what shows up in the diag report. So I'm not sure where they
>>>>> diverge, most of them just don't seem right.  There are 3 subsites rooted
>>>>> in the main site: Abcd, Defghij, Klmnopqr.  It's strange that the 
>>>>> connector
>>>>> would try such paths as:
>>>>>
>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// -- there are
>>>>> multiple repetitions of the same subsite on the path and to begin with,
>>>>> Defghij is not a subsite of Klmnopqr, so why would it try this? the /// at
>>>>> the end doesn't seem correct either, unless I'm missing something in how
>>>>> this pathing works.
>>>>>
>>>>> /Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements --
>>>>> looks wrong. A docname is mixed into the path, a subsite ends up after a
>>>>> docname?...
>>>>>
>>>>> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ --
>>>>> same types of issues plus now somehow the docname got split with a forward
>>>>> slash?..
>>>>>
>>>>> There are also a bunch of StringIndexOutOfBoundsException's.  Perhaps
>>>>> this logic doesn't fit with the pathing we're seeing on this amz-based
>>>>> installation?
>>>>>
>>>>> I'd expect the logic to just know that root contains 3 subsites, and
>>>>> work off that. Each subsite has a specific list of libraries and lists,
>>>>> etc. It seems odd that the connector gets into this matching pattern, and
>>>>> tries what looks like thousands of variations (I aborted the execution).
>>>>>
>>>>> - Dmitry
>>>>>
>>>>>
>>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <[email protected]>wrote:
>>>>>
>>>>>> Hi Dmitry,
>>>>>>
>>>>>> To clarify, the way you would need to analyze this is to run a crawl
>>>>>> with the wildcards as you have selected, abort if necessary after a 
>>>>>> while,
>>>>>> and then use the Document Status report to list the document identifiers
>>>>>> that had been generated.  Find a document identifier that you believe
>>>>>> represents a path that is illegal, and figure out what SOAP getChild call
>>>>>> caused the problem by returning incorrect data.  In other words, find the
>>>>>> point in the path where the path diverges from what exists into what
>>>>>> doesn't exist, and go back in the ManifoldCF logs to find the particular
>>>>>> SOAP request that led to the issue.
>>>>>>
>>>>>> I'd expect from your description that the problem lies with getting
>>>>>> child sites given a site path, but that's just a guess at this point.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <[email protected]>wrote:
>>>>>>
>>>>>>> Hi Dmitry,
>>>>>>>
>>>>>>> I don't understand what you mean by "I've tried the set of wildcards
>>>>>>> as below and I seem to be running into a lot of cycles, where various
>>>>>>> subsite folders are appended to each other and an extraction of data at 
>>>>>>> all
>>>>>>> of those locations is attempted".   If you are seeing cycles it means 
>>>>>>> that
>>>>>>> document discovery is still failing in some way.  For each
>>>>>>> folder/library/site/subsite, only the children of that
>>>>>>> folder/library/site/subsite should be appended to the path - ever.
>>>>>>>
>>>>>>> If you can give a specific example, preferably including the soap
>>>>>>> back-and-forth, that would be very helpful.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> Quick question. Is there an easy way to configure an SP repo
>>>>>>>> connection for crawling of all content, from the root site all the way 
>>>>>>>> down?
>>>>>>>>
>>>>>>>> I've tried the set of wildcards as below and I seem to be running
>>>>>>>> into a lot of cycles, where various subsite folders are appended to 
>>>>>>>> each
>>>>>>>> other and an extraction of data at all of those locations is attempted.
>>>>>>>> Ideally I'd like to avoid having to construct an exact set of paths 
>>>>>>>> because
>>>>>>>> the set may change, especially with new content being added.
>>>>>>>>
>>>>>>>> Path rules:
>>>>>>>> /* file include
>>>>>>>> /* library include
>>>>>>>> /* list include
>>>>>>>> /* site include
>>>>>>>>
>>>>>>>> Metadata:
>>>>>>>> /* include true
>>>>>>>>
>>>>>>>> I'd also like to pull down any files attached to list items. I'm
>>>>>>>> hoping that some type of "/* file include" should do it, once I figure 
>>>>>>>> out
>>>>>>>> how to safely include all content.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> - Dmitry
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to