Karl,

This is everything that got generated, from the very beginning, meaning
that I did a fresh build, new database, new connection definitions, start.
The log must have rolled but the .1 log is included.

If I were to get you access to the actual test system, would you mind
taking a look? It may be more efficient than sending logs..

- Dmitry


On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <[email protected]> wrote:

> These logs are different but have exactly the same problem; they start in
> the middle when the crawl is already well underway.  I'm wondering if by
> chance you have more than one agents process running or something?  Or
> maybe the log is rolling and stuff is getting lost?  What's there is not
> what I would expect to see, at all.
>
> I *did* manage to find two transactions that look like they might be
> helpful, but because the *results* of those transactions are required by
> transactions that take place minutes *before* in the log, I have no
> confidence that I'm looking at anything meaningful.  But I'll get back to
> you on what I find nonetheless.
>
> If you decide repeat this exercise, try watching the log with "tail -f"
> before starting the job.  You should not see any log contents at all until
> the job is started.
>
> Karl
>
>
> On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg <[email protected]
> > wrote:
>
>> Karl,
>>
>> Attached please find logs which start at the beginning. I started from a
>> fresh build (clean db etc.), the logs start at server start, then I create
>> the output connection and the repo connection, then the job, and then I
>> fire off the job. I aborted the execution about a minute into it or so.
>> That's all that's in the logs with:
>>
>> org.apache.manifoldcf.connectors=DEBUG
>>
>> log4j.logger.httpclient.wire.header=DEBUG
>> log4j.logger.org.apache.commons.httpclient=DEBUG
>>
>> - Dmitry
>>
>>
>> On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Dmitry,
>>>
>>> Are you sure these are the right logs?
>>> - They start right in the middle of a crawl
>>> - They are already in a broken state when they start, e.g. the kinds of
>>> things that are being looked up are already nonsense paths
>>>
>>> I need to see logs from the BEGINNING of a fresh crawl to see how the
>>> nonsense paths happen.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>> On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg <
>>> [email protected]> wrote:
>>>
>>>> Karl,
>>>>
>>>> I've generated logs with details as we discussed.
>>>>
>>>> The job was created afresh, as before:
>>>> Path rules:
>>>> /* file include
>>>> /* library include
>>>> /* list include
>>>> /* site include
>>>> Metadata:
>>>> /* include true
>>>> The logs are attached.
>>>> - Dmitry
>>>>
>>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright <[email protected]>wrote:
>>>>
>>>>> "Do you think that this issue is generic with regard to any Amz
>>>>> instance?"
>>>>>
>>>>> I presume so, since you didn't apparently do anything special to set
>>>>> one of these up.  Unfortunately, such instances are not part of the free
>>>>> tier, so I am still constrained from setting one up for myself because of
>>>>> household rules here.
>>>>>
>>>>> "For now, I assume our only workaround is to list the paths of
>>>>> interest manually"
>>>>>
>>>>> Depending on what is going wrong, that may not even work.  It looks
>>>>> like several SharePoint web service calls may be affected, and not in a
>>>>> cleanly predictable way, for this to happen.
>>>>>
>>>>> "is identification and extraction of attachments supported in the SP
>>>>> connector?"
>>>>>
>>>>> ManifoldCF in general leaves identification and extraction to the
>>>>> search engine.  Solr, for instance uses Tika for this, if so configured.
>>>>> You can configure your Solr output connection to include or exclude
>>>>> specific mime types or extensions if you want to limit what is attempted.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry Goldenberg <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks, Karl. Do you think that this issue is generic with regard to
>>>>>> any Amz instance? I'm just wondering how easily reproducible this may 
>>>>>> be..
>>>>>>
>>>>>> For now, I assume our only workaround is to list the paths of
>>>>>> interest manually, i.e. add explicit rules for each library and list.
>>>>>>
>>>>>> A related subject - is identification and extraction of attachments
>>>>>> supported in the SP connector?  E.g. if I have a Word doc attached to a
>>>>>> Task list item, would that be extracted?  So far, I see that library
>>>>>> content gets crawled and I'm getting the list item data but am not sure
>>>>>> what happens to the attachments.
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <[email protected]>wrote:
>>>>>>
>>>>>>> Hi Dmitry,
>>>>>>>
>>>>>>> Thanks for the additional information.  It does appear like the
>>>>>>> method that lists subsites is not working as expected under AWS.  Nor 
>>>>>>> are
>>>>>>> some number of other methods which supposedly just list the children of 
>>>>>>> a
>>>>>>> subsite.
>>>>>>>
>>>>>>> I've reopened CONNECTORS-772 to work on addressing this issue.
>>>>>>> Please stay tuned.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> Most of the paths that get generated are listed in the attached
>>>>>>>> log, they match what shows up in the diag report. So I'm not sure where
>>>>>>>> they diverge, most of them just don't seem right.  There are 3 subsites
>>>>>>>> rooted in the main site: Abcd, Defghij, Klmnopqr.  It's strange that 
>>>>>>>> the
>>>>>>>> connector would try such paths as:
>>>>>>>>
>>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// -- there are
>>>>>>>> multiple repetitions of the same subsite on the path and to begin with,
>>>>>>>> Defghij is not a subsite of Klmnopqr, so why would it try this? the 
>>>>>>>> /// at
>>>>>>>> the end doesn't seem correct either, unless I'm missing something in 
>>>>>>>> how
>>>>>>>> this pathing works.
>>>>>>>>
>>>>>>>> /Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements
>>>>>>>> -- looks wrong. A docname is mixed into the path, a subsite ends up 
>>>>>>>> after a
>>>>>>>> docname?...
>>>>>>>>
>>>>>>>> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ --
>>>>>>>> same types of issues plus now somehow the docname got split with a 
>>>>>>>> forward
>>>>>>>> slash?..
>>>>>>>>
>>>>>>>> There are also a bunch of StringIndexOutOfBoundsException's.
>>>>>>>> Perhaps this logic doesn't fit with the pathing we're seeing on this
>>>>>>>> amz-based installation?
>>>>>>>>
>>>>>>>> I'd expect the logic to just know that root contains 3 subsites,
>>>>>>>> and work off that. Each subsite has a specific list of libraries and 
>>>>>>>> lists,
>>>>>>>> etc. It seems odd that the connector gets into this matching pattern, 
>>>>>>>> and
>>>>>>>> tries what looks like thousands of variations (I aborted the 
>>>>>>>> execution).
>>>>>>>>
>>>>>>>> - Dmitry
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Hi Dmitry,
>>>>>>>>>
>>>>>>>>> To clarify, the way you would need to analyze this is to run a
>>>>>>>>> crawl with the wildcards as you have selected, abort if necessary 
>>>>>>>>> after a
>>>>>>>>> while, and then use the Document Status report to list the document
>>>>>>>>> identifiers that had been generated.  Find a document identifier that 
>>>>>>>>> you
>>>>>>>>> believe represents a path that is illegal, and figure out what SOAP
>>>>>>>>> getChild call caused the problem by returning incorrect data.  In 
>>>>>>>>> other
>>>>>>>>> words, find the point in the path where the path diverges from what 
>>>>>>>>> exists
>>>>>>>>> into what doesn't exist, and go back in the ManifoldCF logs to find 
>>>>>>>>> the
>>>>>>>>> particular SOAP request that led to the issue.
>>>>>>>>>
>>>>>>>>> I'd expect from your description that the problem lies with
>>>>>>>>> getting child sites given a site path, but that's just a guess at this
>>>>>>>>> point.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright 
>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> Hi Dmitry,
>>>>>>>>>>
>>>>>>>>>> I don't understand what you mean by "I've tried the set of
>>>>>>>>>> wildcards as below and I seem to be running into a lot of cycles, 
>>>>>>>>>> where
>>>>>>>>>> various subsite folders are appended to each other and an extraction 
>>>>>>>>>> of
>>>>>>>>>> data at all of those locations is attempted".   If you are seeing 
>>>>>>>>>> cycles it
>>>>>>>>>> means that document discovery is still failing in some way.  For each
>>>>>>>>>> folder/library/site/subsite, only the children of that
>>>>>>>>>> folder/library/site/subsite should be appended to the path - ever.
>>>>>>>>>>
>>>>>>>>>> If you can give a specific example, preferably including the soap
>>>>>>>>>> back-and-forth, that would be very helpful.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>
>>>>>>>>>>> Quick question. Is there an easy way to configure an SP repo
>>>>>>>>>>> connection for crawling of all content, from the root site all the 
>>>>>>>>>>> way down?
>>>>>>>>>>>
>>>>>>>>>>> I've tried the set of wildcards as below and I seem to be
>>>>>>>>>>> running into a lot of cycles, where various subsite folders are 
>>>>>>>>>>> appended to
>>>>>>>>>>> each other and an extraction of data at all of those locations is
>>>>>>>>>>> attempted. Ideally I'd like to avoid having to construct an exact 
>>>>>>>>>>> set of
>>>>>>>>>>> paths because the set may change, especially with new content being 
>>>>>>>>>>> added.
>>>>>>>>>>>
>>>>>>>>>>> Path rules:
>>>>>>>>>>> /* file include
>>>>>>>>>>> /* library include
>>>>>>>>>>> /* list include
>>>>>>>>>>> /* site include
>>>>>>>>>>>
>>>>>>>>>>> Metadata:
>>>>>>>>>>> /* include true
>>>>>>>>>>>
>>>>>>>>>>> I'd also like to pull down any files attached to list items. I'm
>>>>>>>>>>> hoping that some type of "/* file include" should do it, once I 
>>>>>>>>>>> figure out
>>>>>>>>>>> how to safely include all content.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> - Dmitry
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to