Karl, This is everything that got generated, from the very beginning, meaning that I did a fresh build, new database, new connection definitions, start. The log must have rolled but the .1 log is included.
If I were to get you access to the actual test system, would you mind taking a look? It may be more efficient than sending logs.. - Dmitry On Mon, Sep 16, 2013 at 1:48 PM, Karl Wright <[email protected]> wrote: > These logs are different but have exactly the same problem; they start in > the middle when the crawl is already well underway. I'm wondering if by > chance you have more than one agents process running or something? Or > maybe the log is rolling and stuff is getting lost? What's there is not > what I would expect to see, at all. > > I *did* manage to find two transactions that look like they might be > helpful, but because the *results* of those transactions are required by > transactions that take place minutes *before* in the log, I have no > confidence that I'm looking at anything meaningful. But I'll get back to > you on what I find nonetheless. > > If you decide repeat this exercise, try watching the log with "tail -f" > before starting the job. You should not see any log contents at all until > the job is started. > > Karl > > > On Mon, Sep 16, 2013 at 1:11 PM, Dmitry Goldenberg <[email protected] > > wrote: > >> Karl, >> >> Attached please find logs which start at the beginning. I started from a >> fresh build (clean db etc.), the logs start at server start, then I create >> the output connection and the repo connection, then the job, and then I >> fire off the job. I aborted the execution about a minute into it or so. >> That's all that's in the logs with: >> >> org.apache.manifoldcf.connectors=DEBUG >> >> log4j.logger.httpclient.wire.header=DEBUG >> log4j.logger.org.apache.commons.httpclient=DEBUG >> >> - Dmitry >> >> >> On Mon, Sep 16, 2013 at 12:39 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Dmitry, >>> >>> Are you sure these are the right logs? >>> - They start right in the middle of a crawl >>> - They are already in a broken state when they start, e.g. the kinds of >>> things that are being looked up are already nonsense paths >>> >>> I need to see logs from the BEGINNING of a fresh crawl to see how the >>> nonsense paths happen. >>> >>> Thanks, >>> Karl >>> >>> >>> >>> >>> On Mon, Sep 16, 2013 at 11:52 AM, Dmitry Goldenberg < >>> [email protected]> wrote: >>> >>>> Karl, >>>> >>>> I've generated logs with details as we discussed. >>>> >>>> The job was created afresh, as before: >>>> Path rules: >>>> /* file include >>>> /* library include >>>> /* list include >>>> /* site include >>>> Metadata: >>>> /* include true >>>> The logs are attached. >>>> - Dmitry >>>> >>>> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright <[email protected]>wrote: >>>> >>>>> "Do you think that this issue is generic with regard to any Amz >>>>> instance?" >>>>> >>>>> I presume so, since you didn't apparently do anything special to set >>>>> one of these up. Unfortunately, such instances are not part of the free >>>>> tier, so I am still constrained from setting one up for myself because of >>>>> household rules here. >>>>> >>>>> "For now, I assume our only workaround is to list the paths of >>>>> interest manually" >>>>> >>>>> Depending on what is going wrong, that may not even work. It looks >>>>> like several SharePoint web service calls may be affected, and not in a >>>>> cleanly predictable way, for this to happen. >>>>> >>>>> "is identification and extraction of attachments supported in the SP >>>>> connector?" >>>>> >>>>> ManifoldCF in general leaves identification and extraction to the >>>>> search engine. Solr, for instance uses Tika for this, if so configured. >>>>> You can configure your Solr output connection to include or exclude >>>>> specific mime types or extensions if you want to limit what is attempted. >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry Goldenberg < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks, Karl. Do you think that this issue is generic with regard to >>>>>> any Amz instance? I'm just wondering how easily reproducible this may >>>>>> be.. >>>>>> >>>>>> For now, I assume our only workaround is to list the paths of >>>>>> interest manually, i.e. add explicit rules for each library and list. >>>>>> >>>>>> A related subject - is identification and extraction of attachments >>>>>> supported in the SP connector? E.g. if I have a Word doc attached to a >>>>>> Task list item, would that be extracted? So far, I see that library >>>>>> content gets crawled and I'm getting the list item data but am not sure >>>>>> what happens to the attachments. >>>>>> >>>>>> >>>>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <[email protected]>wrote: >>>>>> >>>>>>> Hi Dmitry, >>>>>>> >>>>>>> Thanks for the additional information. It does appear like the >>>>>>> method that lists subsites is not working as expected under AWS. Nor >>>>>>> are >>>>>>> some number of other methods which supposedly just list the children of >>>>>>> a >>>>>>> subsite. >>>>>>> >>>>>>> I've reopened CONNECTORS-772 to work on addressing this issue. >>>>>>> Please stay tuned. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> >>>>>>>> Most of the paths that get generated are listed in the attached >>>>>>>> log, they match what shows up in the diag report. So I'm not sure where >>>>>>>> they diverge, most of them just don't seem right. There are 3 subsites >>>>>>>> rooted in the main site: Abcd, Defghij, Klmnopqr. It's strange that >>>>>>>> the >>>>>>>> connector would try such paths as: >>>>>>>> >>>>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// -- there are >>>>>>>> multiple repetitions of the same subsite on the path and to begin with, >>>>>>>> Defghij is not a subsite of Klmnopqr, so why would it try this? the >>>>>>>> /// at >>>>>>>> the end doesn't seem correct either, unless I'm missing something in >>>>>>>> how >>>>>>>> this pathing works. >>>>>>>> >>>>>>>> /Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements >>>>>>>> -- looks wrong. A docname is mixed into the path, a subsite ends up >>>>>>>> after a >>>>>>>> docname?... >>>>>>>> >>>>>>>> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- >>>>>>>> same types of issues plus now somehow the docname got split with a >>>>>>>> forward >>>>>>>> slash?.. >>>>>>>> >>>>>>>> There are also a bunch of StringIndexOutOfBoundsException's. >>>>>>>> Perhaps this logic doesn't fit with the pathing we're seeing on this >>>>>>>> amz-based installation? >>>>>>>> >>>>>>>> I'd expect the logic to just know that root contains 3 subsites, >>>>>>>> and work off that. Each subsite has a specific list of libraries and >>>>>>>> lists, >>>>>>>> etc. It seems odd that the connector gets into this matching pattern, >>>>>>>> and >>>>>>>> tries what looks like thousands of variations (I aborted the >>>>>>>> execution). >>>>>>>> >>>>>>>> - Dmitry >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <[email protected]>wrote: >>>>>>>> >>>>>>>>> Hi Dmitry, >>>>>>>>> >>>>>>>>> To clarify, the way you would need to analyze this is to run a >>>>>>>>> crawl with the wildcards as you have selected, abort if necessary >>>>>>>>> after a >>>>>>>>> while, and then use the Document Status report to list the document >>>>>>>>> identifiers that had been generated. Find a document identifier that >>>>>>>>> you >>>>>>>>> believe represents a path that is illegal, and figure out what SOAP >>>>>>>>> getChild call caused the problem by returning incorrect data. In >>>>>>>>> other >>>>>>>>> words, find the point in the path where the path diverges from what >>>>>>>>> exists >>>>>>>>> into what doesn't exist, and go back in the ManifoldCF logs to find >>>>>>>>> the >>>>>>>>> particular SOAP request that led to the issue. >>>>>>>>> >>>>>>>>> I'd expect from your description that the problem lies with >>>>>>>>> getting child sites given a site path, but that's just a guess at this >>>>>>>>> point. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright >>>>>>>>> <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> Hi Dmitry, >>>>>>>>>> >>>>>>>>>> I don't understand what you mean by "I've tried the set of >>>>>>>>>> wildcards as below and I seem to be running into a lot of cycles, >>>>>>>>>> where >>>>>>>>>> various subsite folders are appended to each other and an extraction >>>>>>>>>> of >>>>>>>>>> data at all of those locations is attempted". If you are seeing >>>>>>>>>> cycles it >>>>>>>>>> means that document discovery is still failing in some way. For each >>>>>>>>>> folder/library/site/subsite, only the children of that >>>>>>>>>> folder/library/site/subsite should be appended to the path - ever. >>>>>>>>>> >>>>>>>>>> If you can give a specific example, preferably including the soap >>>>>>>>>> back-and-forth, that would be very helpful. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Karl, >>>>>>>>>>> >>>>>>>>>>> Quick question. Is there an easy way to configure an SP repo >>>>>>>>>>> connection for crawling of all content, from the root site all the >>>>>>>>>>> way down? >>>>>>>>>>> >>>>>>>>>>> I've tried the set of wildcards as below and I seem to be >>>>>>>>>>> running into a lot of cycles, where various subsite folders are >>>>>>>>>>> appended to >>>>>>>>>>> each other and an extraction of data at all of those locations is >>>>>>>>>>> attempted. Ideally I'd like to avoid having to construct an exact >>>>>>>>>>> set of >>>>>>>>>>> paths because the set may change, especially with new content being >>>>>>>>>>> added. >>>>>>>>>>> >>>>>>>>>>> Path rules: >>>>>>>>>>> /* file include >>>>>>>>>>> /* library include >>>>>>>>>>> /* list include >>>>>>>>>>> /* site include >>>>>>>>>>> >>>>>>>>>>> Metadata: >>>>>>>>>>> /* include true >>>>>>>>>>> >>>>>>>>>>> I'd also like to pull down any files attached to list items. I'm >>>>>>>>>>> hoping that some type of "/* file include" should do it, once I >>>>>>>>>>> figure out >>>>>>>>>>> how to safely include all content. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> - Dmitry >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
