Re: Diagnosing REJECTED documents in job history

2013-02-04 Thread Andrew Clegg
Sadly, I did a completely fresh build, with a new database, and I
still get REJECTED for all the documents found, with no log messages.

I also tried upgrading my DFC jars to those from Documentum 6.7 as one
of my colleagues pointed out that we use 6.6 which doesn't officially
support IDfSysObject.getContentType. Turns out that this method
returns the content type correctly if you use the 6.7 jars, even if
(like us) your Documentum installation is only 6.6 -- we verified this
with a quick Java test.

However, this doesn't seem to make a difference to our ManifoldCF problem.

I'm pretty stumped -- I think I might have to fire up ManifoldCF in a
debug JVM and set some breakpoints.


On 2 February 2013 18:14, Karl Wright daddy...@gmail.com wrote:
 On Sat, Feb 2, 2013 at 10:55 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Thanks Karl -- I'll do a new build on Monday and go through all the
 setup again from scratch to make sure I haven't left anything out.

 Pretty sure I'm running against DFC as it wouldn't be able to get a
 list of documents otherwise, presumably?


 If you had an existing, already-crawled job it is potentially possible
 that if you then substituted the stub it might do something funky like
 this.  Just checking...

 Karl

 On 1 February 2013 18:03, Karl Wright daddy...@gmail.com wrote:
 I changed the ElasticSearch connector yet again, so that if it sees a
 null content type, it interprets it as application/unknown.  At
 least then you can make some progress until you can figure out why
 there is no content type coming out of documentum.

 Karl


 On Fri, Feb 1, 2013 at 12:44 PM, Karl Wright daddy...@gmail.com wrote:
 Are you sure that, after you updated, you are running the Documentum
 connector server process against DFC, and not with the ManifoldCF
 build stubs?

 The code in the connector is pretty simple; it just uses the
 getContentType() method from the IDfSysObject that represents the
 document.  That should be darned near foolproof.

 Karl


 On Fri, Feb 1, 2013 at 12:30 PM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 We have something called DAM instead of Webtop -- Digitial Asset
 Manager I think? (Not a Documentum expert...)

 In DAM they show as format: pdf but it doesn't explicitly say what
 mimetype they are. I will escalate this to our Documentum support
 people, in case it isn't sending a mimetype.

 On 1 February 2013 16:02, Karl Wright daddy...@gmail.com wrote:
 You can't significantly change the behavior of the documentum
 connector by simply changing the configuration of the elastic search
 output connector.  Did anything else change that would account for the
 missing mime types?  Do you see the mime types when you look at the
 documents in Webtop?

 Karl

 On Fri, Feb 1, 2013 at 10:57 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Now I'm back to seeing all the documents showing as REJECTED at the
 fetch stage in the job history. There's nothing in the logs to say why
 though.

 I guess this means it's Documentum's fault for sending docs without
 mime types then?

 Thanks again for all your help!

 On 1 February 2013 15:14, Karl Wright daddy...@gmail.com wrote:
 OK, I've checked in a fix to trunk.

 Please synch up and try again.
 Karl

 On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright daddy...@gmail.com 
 wrote:
 The problem is that there are some documents you are indexing that
 have no mime type set at all.  The ElasticSearch connector is not
 handling that case properly.  I've opened ticket CONNECTORS-637, and
 will fix it shortly.

 Karl

 On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Hi Karl,

 The extended logging has helped me find the next problem :-)

 Now I'm seeing hundreds of exceptions like this in the manifold log:


 FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: 
 null
 java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.containsKey(TreeMap.java:209)
 at java.util.TreeSet.contains(TreeSet.java:217)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)


 There'll be a whole batch, then a pause, then another batch. I 
 suspect
 this is because MCF 

Re: Diagnosing REJECTED documents in job history

2013-02-01 Thread Karl Wright
The problem is that there are some documents you are indexing that
have no mime type set at all.  The ElasticSearch connector is not
handling that case properly.  I've opened ticket CONNECTORS-637, and
will fix it shortly.

Karl

On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Hi Karl,

 The extended logging has helped me find the next problem :-)

 Now I'm seeing hundreds of exceptions like this in the manifold log:


 FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null
 java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.containsKey(TreeMap.java:209)
 at java.util.TreeSet.contains(TreeSet.java:217)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)


 There'll be a whole batch, then a pause, then another batch. I suspect
 this is because MCF is retrying?

 My theory about this is that Documentum is returning the mime type as
 just pdf instead of application/pdf -- although I did add pdf as
 an allowed mime type in the ElasticSearch page of the job config, just
 to see if it would parse this ok.

 Do you know if there's any way to map from a source's content type to
 a destination's content type?



 On 31 January 2013 23:09, Karl Wright daddy...@gmail.com wrote:
 I just chased down and fixed a problem in trunk.  ElasticSearch is now
 returning a 201 code for successful indexing in some cases, and the
 connector was not handling that as 'success'.

 Karl


 On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright daddy...@gmail.com wrote:
 Please let me know if you see any problems.  I'll fix anything you
 find as quickly as I can.

 Karl

 On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Great, thanks, I'll give it a try.

 On 30 January 2013 18:52, Karl Wright daddy...@gmail.com wrote:
 I just checked in a refactoring to trunk that should improve Elastic
 Search error reporting significantly.

 Karl


 On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote:
 I agree that the Elastic Search connector needs far better logging and
 error handling.  CONNECTORS-629.

 Karl

 On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Nailed it with the help of wireshark! Turns out it was my fault -- I
 had set it up to use (i.e. create) an index called DocumentumRoW but
 it turns out ES index names must be all lowercase.

 Never knew that before.

 Slightly annoyed that ES didn't log that...

 Thanks again for your help Karl :-)

 My only request on the MCF front would be that it would be nice for
 the output connector to log the actual status code and content of a
 non-successful HTTP response.


 On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg 
 andrew.cl...@gmail.com wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very 
 robust

Re: Diagnosing REJECTED documents in job history

2013-02-01 Thread Karl Wright
OK, I've checked in a fix to trunk.

Please synch up and try again.
Karl

On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright daddy...@gmail.com wrote:
 The problem is that there are some documents you are indexing that
 have no mime type set at all.  The ElasticSearch connector is not
 handling that case properly.  I've opened ticket CONNECTORS-637, and
 will fix it shortly.

 Karl

 On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Hi Karl,

 The extended logging has helped me find the next problem :-)

 Now I'm seeing hundreds of exceptions like this in the manifold log:


 FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null
 java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.containsKey(TreeMap.java:209)
 at java.util.TreeSet.contains(TreeSet.java:217)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)


 There'll be a whole batch, then a pause, then another batch. I suspect
 this is because MCF is retrying?

 My theory about this is that Documentum is returning the mime type as
 just pdf instead of application/pdf -- although I did add pdf as
 an allowed mime type in the ElasticSearch page of the job config, just
 to see if it would parse this ok.

 Do you know if there's any way to map from a source's content type to
 a destination's content type?



 On 31 January 2013 23:09, Karl Wright daddy...@gmail.com wrote:
 I just chased down and fixed a problem in trunk.  ElasticSearch is now
 returning a 201 code for successful indexing in some cases, and the
 connector was not handling that as 'success'.

 Karl


 On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright daddy...@gmail.com wrote:
 Please let me know if you see any problems.  I'll fix anything you
 find as quickly as I can.

 Karl

 On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Great, thanks, I'll give it a try.

 On 30 January 2013 18:52, Karl Wright daddy...@gmail.com wrote:
 I just checked in a refactoring to trunk that should improve Elastic
 Search error reporting significantly.

 Karl


 On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote:
 I agree that the Elastic Search connector needs far better logging and
 error handling.  CONNECTORS-629.

 Karl

 On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Nailed it with the help of wireshark! Turns out it was my fault -- I
 had set it up to use (i.e. create) an index called DocumentumRoW but
 it turns out ES index names must be all lowercase.

 Never knew that before.

 Slightly annoyed that ES didn't log that...

 Thanks again for your help Karl :-)

 My only request on the MCF front would be that it would be nice for
 the output connector to log the actual status code and content of a
 non-successful HTTP response.


 On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg 
 andrew.cl...@gmail.com wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if 
 

Re: Diagnosing REJECTED documents in job history

2013-01-31 Thread Andrew Clegg
Great, thanks, I'll give it a try.

On 30 January 2013 18:52, Karl Wright daddy...@gmail.com wrote:
 I just checked in a refactoring to trunk that should improve Elastic
 Search error reporting significantly.

 Karl


 On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote:
 I agree that the Elastic Search connector needs far better logging and
 error handling.  CONNECTORS-629.

 Karl

 On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Nailed it with the help of wireshark! Turns out it was my fault -- I
 had set it up to use (i.e. create) an index called DocumentumRoW but
 it turns out ES index names must be all lowercase.

 Never knew that before.

 Slightly annoyed that ES didn't log that...

 Thanks again for your help Karl :-)

 My only request on the MCF front would be that it would be nice for
 the output connector to log the actual status code and content of a
 non-successful HTTP response.


 On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very robust
 at this point, but I can fix that quickly with your help. ;-)

 Karl

 On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.



 --

 

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Andrew Clegg
Hi Karl,

I finally had a chance to go back to this and here's what I found.

Documentum was returning pdf and pdftext for the content type, not
a full mime type, so as an experiment I added these to the list of
allowed mime types in the ElasticSearch configuration for the job.

This time, it got slightly further -- the corresponding documents now
show as Success instead of REJECTED in the job history.

However, they don't show up in ElasticSearch, and there's nothing in
the ES logs or console to indicate that ManifoldCF ever even tried to
connect. It's like it's just dropping them and declaring the job a
success.

On the other hand, there are lots of messages like this in the MCF log:

WARN 2013-01-30 13:08:16,431 (Worker thread '12') - Pre-ingest service
interruption reported for job 1358442009776 connection 'Documentum
RoW': Job no longer active

Any idea if this could be related?


On 21 January 2013 12:29, Karl Wright daddy...@gmail.com wrote:
 Logging output is a function of each connector, and unfortunately the
 documentum connector has pretty limited logging.

 The extension exclusions are unlikely to be in play because the
 Documentum connector does not use them.  So it would be only mime type
 and length.  You should be able to check both of these properties of
 specific documents you are missing in the Document Webtop UI.

 Karl

   @Override
   public boolean checkLengthIndexable(String outputDescription, long length)
   throws ManifoldCFException, ServiceInterruption
   {
 ElasticSearchSpecs specs = getSpecsCache(outputDescription);
 long maxFileSize = specs.getMaxFileSize();
 if (length  maxFileSize)
   return false;
 return super.checkLengthIndexable(outputDescription, length);
   }

   @Override
   public boolean checkDocumentIndexable(String outputDescription, File
 localFile)
   throws ManifoldCFException, ServiceInterruption
   {
 ElasticSearchSpecs specs = getSpecsCache(outputDescription);
 return specs
 .checkExtension(FilenameUtils.getExtension(localFile.getName()));
   }

   @Override
   public boolean checkMimeTypeIndexable(String outputDescription,
   String mimeType) throws ManifoldCFException, ServiceInterruption
   {
 ElasticSearchSpecs specs = getSpecsCache(outputDescription);
 return specs.checkMimeType(mimeType);
   }


 On Mon, Jan 21, 2013 at 6:50 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Just to clarify that last post, I haven't disabled any of the allowed
 mime types for ES, so as long as they're not something really weird it
 should be fine.

 Unless it's a file extension problem (ES also has allowed file
 extensions) but is there a way to get that level of information about
 each document out of MCF?

 Can you enable verbose logging somehow to see what type, size and
 extension each processed document was?

 On 21 January 2013 11:47, Andrew Clegg andrew.cl...@gmail.com wrote:
 So, the only content types in Documentum are pdf and pdftext.

 application/pdf is enabled in the ES tab in the job config. (I
 assume they both map to application/pdf -- how would I check for
 sure?)

 And my max file size is 16777216000 which is wy bigger than any of
 the rejected documents.

 Sadly it's still rejecting them all.


 On 21 January 2013 11:33, Andrew Clegg andrew.cl...@gmail.com wrote:
 Close, it's ElasticSearch. Okay, I'll play around with these, thanks.

 On 21 January 2013 11:26, Karl Wright daddy...@gmail.com wrote:
 Hi Andrew,

 The reason for rejection has to do with the criteria you provide for
 the job.  Specifically:

   if (activities.checkLengthIndexable(fileLength) 
 activities.checkMimeTypeIndexable(contentType))
   {
 ...

 These are provided by your output connection; in there you may specify
 what mime types and what file length cutoff you want.  From the fact
 that you get these, I am guessing it's a Solr connection.  These
 criteria typically show up on tabs for the job definition.

 Karl

 On Mon, Jan 21, 2013 at 4:52 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Hi,

 I'm trying to set up a fairly simple crawl where I pull documents from
 Documentum and push them into ElasticSearch, using the 1.0.1 binary
 release with all appropriate extras for Documentum added.

 The repository connection looks fine -- in the job config I can see
 the paths, document types, content types etc. as expected.

 Also the ES output connection looks fine, it reports connection 
 working.

 However, when I do a crawl, every document it attempts to ingest shows
 this in the job history:

 01-18-2013 17:36:24.279 fetch 0902620580069898 REJECTED 6264431

 (date, time, activity, identifier, result code, bytes, time)

 How can I go about diagnosing what's causing this?

 I can't see anything suspect in the ManifoldCF stdout or log, and
 there's nothing in the Documentum server process or registry process
 output or logs either.

 Any ideas how I'd go about diagnosing this?

 The 

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
Ok, so let's back up a bit.

First, which version of ManifoldCF is this?  I need to know that
before I can interpret the stack trace.

Second, what do you see when you view the connection in the crawler
UI?  Does it say Connection working, or something else, and if so,
what?

I've created a ticket for better error reporting in this connector -
it was a contribution and AFAIK the error handling is not very robust
at this point, but I can fix that quickly with your help. ;-)

Karl

On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.


Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Andrew Clegg
Thanks for all your help Karl!

It's 1.0.1 from the binary distro.

And yes, it says Connection working when I view it.

On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very robust
 at this point, but I can fix that quickly with your help. ;-)

 Karl

 On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg


Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Andrew Clegg
That information isn't being recorded in manifoldcf.log unfortunately
-- I included all that was there. And there are no exceptions in
elasticsearch.log either...

I'll try running wireshark to see if I can follow the TCP stream.



On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very robust
 at this point, but I can fix that quickly with your help. ;-)

 Karl

 On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg


Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Andrew Clegg
Nailed it with the help of wireshark! Turns out it was my fault -- I
had set it up to use (i.e. create) an index called DocumentumRoW but
it turns out ES index names must be all lowercase.

Never knew that before.

Slightly annoyed that ES didn't log that...

Thanks again for your help Karl :-)

My only request on the MCF front would be that it would be nice for
the output connector to log the actual status code and content of a
non-successful HTTP response.


On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very robust
 at this point, but I can fix that quickly with your help. ;-)

 Karl

 On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg


Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
I agree that the Elastic Search connector needs far better logging and
error handling.  CONNECTORS-629.

Karl

On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Nailed it with the help of wireshark! Turns out it was my fault -- I
 had set it up to use (i.e. create) an index called DocumentumRoW but
 it turns out ES index names must be all lowercase.

 Never knew that before.

 Slightly annoyed that ES didn't log that...

 Thanks again for your help Karl :-)

 My only request on the MCF front would be that it would be nice for
 the output connector to log the actual status code and content of a
 non-successful HTTP response.


 On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very robust
 at this point, but I can fix that quickly with your help. ;-)

 Karl

 On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg


Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
I just checked in a refactoring to trunk that should improve Elastic
Search error reporting significantly.

Karl


On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote:
 I agree that the Elastic Search connector needs far better logging and
 error handling.  CONNECTORS-629.

 Karl

 On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Nailed it with the help of wireshark! Turns out it was my fault -- I
 had set it up to use (i.e. create) an index called DocumentumRoW but
 it turns out ES index names must be all lowercase.

 Never knew that before.

 Slightly annoyed that ES didn't log that...

 Thanks again for your help Karl :-)

 My only request on the MCF front would be that it would be nice for
 the output connector to log the actual status code and content of a
 non-successful HTTP response.


 On 30 January 2013 14:21, Andrew Clegg andrew.cl...@gmail.com wrote:
 That information isn't being recorded in manifoldcf.log unfortunately
 -- I included all that was there. And there are no exceptions in
 elasticsearch.log either...

 I'll try running wireshark to see if I can follow the TCP stream.



 On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:
 Ok, ElasticSearch is not happy about something when the document is
 being posted.  The connector is seeing a non-200 HTTP response, and
 throwing an exception as a result:

   if (!checkResultCode(method.getStatusCode()))
 throw new ManifoldCFException(getResultDescription());

 Presumably the exception message in the log tells us what that HTTP
 code is, but you did not include that key info.

 Karl

 On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Thanks for all your help Karl!

 It's 1.0.1 from the binary distro.

 And yes, it says Connection working when I view it.

 On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote:
 Ok, so let's back up a bit.

 First, which version of ManifoldCF is this?  I need to know that
 before I can interpret the stack trace.

 Second, what do you see when you view the connection in the crawler
 UI?  Does it say Connection working, or something else, and if so,
 what?

 I've created a ticket for better error reporting in this connector -
 it was a contribution and AFAIK the error handling is not very robust
 at this point, but I can fix that quickly with your help. ;-)

 Karl

 On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 On 30 January 2013 13:33, Karl Wright daddy...@gmail.com wrote:

 So you saw events in the history which correspond to these documents
 and which are of type Indexation that say success?  If that is the
 case, then the ElasticSearch connector thinks it handed the documents
 successfully to the ElasticSearch server.

 Ah, no, the activity is fetch rather than indexation. e.g.

 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361

 I don't see any history entries relating to indexing as a specific
 activity in its own right. Sorry, that was probably a red herring, I
 don't think it's getting that far.

 I just noticed that above all the service interruption reported
 warnings are some errors like this:

 ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception tossed:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException:
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.init(ElasticSearchIndex.java:138)
 at 
 org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
 at 
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
 at 
 org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
 at 
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)

 Sadly there's no description, just a stacktrace.

 I know the ES server is visible from the MCF server -- actually
 they're the same machine, and it's configured to use
 http://127.0.0.1:9200/ as the server URL. And I can go to the command
 line on that server and curl that URL successfully.



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



 --

 http://tinyurl.com/andrew-clegg-linkedin | 

Re: Diagnosing REJECTED documents in job history

2013-01-21 Thread Andrew Clegg
Close, it's ElasticSearch. Okay, I'll play around with these, thanks.

On 21 January 2013 11:26, Karl Wright daddy...@gmail.com wrote:
 Hi Andrew,

 The reason for rejection has to do with the criteria you provide for
 the job.  Specifically:

   if (activities.checkLengthIndexable(fileLength) 
 activities.checkMimeTypeIndexable(contentType))
   {
 ...

 These are provided by your output connection; in there you may specify
 what mime types and what file length cutoff you want.  From the fact
 that you get these, I am guessing it's a Solr connection.  These
 criteria typically show up on tabs for the job definition.

 Karl

 On Mon, Jan 21, 2013 at 4:52 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Hi,

 I'm trying to set up a fairly simple crawl where I pull documents from
 Documentum and push them into ElasticSearch, using the 1.0.1 binary
 release with all appropriate extras for Documentum added.

 The repository connection looks fine -- in the job config I can see
 the paths, document types, content types etc. as expected.

 Also the ES output connection looks fine, it reports connection working.

 However, when I do a crawl, every document it attempts to ingest shows
 this in the job history:

 01-18-2013 17:36:24.279 fetch 0902620580069898 REJECTED 6264431

 (date, time, activity, identifier, result code, bytes, time)

 How can I go about diagnosing what's causing this?

 I can't see anything suspect in the ManifoldCF stdout or log, and
 there's nothing in the Documentum server process or registry process
 output or logs either.

 Any ideas how I'd go about diagnosing this?

 The Documentum server is on a remote machine administered by a
 different team, that I don't have direct access to, so any tips for
 things I could try at my end before escalating it to them would be
 particularly useful.

 Thanks,

 Andrew.



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg


Re: Diagnosing REJECTED documents in job history

2013-01-21 Thread Andrew Clegg
So, the only content types in Documentum are pdf and pdftext.

application/pdf is enabled in the ES tab in the job config. (I
assume they both map to application/pdf -- how would I check for
sure?)

And my max file size is 16777216000 which is wy bigger than any of
the rejected documents.

Sadly it's still rejecting them all.


On 21 January 2013 11:33, Andrew Clegg andrew.cl...@gmail.com wrote:
 Close, it's ElasticSearch. Okay, I'll play around with these, thanks.

 On 21 January 2013 11:26, Karl Wright daddy...@gmail.com wrote:
 Hi Andrew,

 The reason for rejection has to do with the criteria you provide for
 the job.  Specifically:

   if (activities.checkLengthIndexable(fileLength) 
 activities.checkMimeTypeIndexable(contentType))
   {
 ...

 These are provided by your output connection; in there you may specify
 what mime types and what file length cutoff you want.  From the fact
 that you get these, I am guessing it's a Solr connection.  These
 criteria typically show up on tabs for the job definition.

 Karl

 On Mon, Jan 21, 2013 at 4:52 AM, Andrew Clegg andrew.cl...@gmail.com wrote:
 Hi,

 I'm trying to set up a fairly simple crawl where I pull documents from
 Documentum and push them into ElasticSearch, using the 1.0.1 binary
 release with all appropriate extras for Documentum added.

 The repository connection looks fine -- in the job config I can see
 the paths, document types, content types etc. as expected.

 Also the ES output connection looks fine, it reports connection working.

 However, when I do a crawl, every document it attempts to ingest shows
 this in the job history:

 01-18-2013 17:36:24.279 fetch 0902620580069898 REJECTED 6264431

 (date, time, activity, identifier, result code, bytes, time)

 How can I go about diagnosing what's causing this?

 I can't see anything suspect in the ManifoldCF stdout or log, and
 there's nothing in the Documentum server process or registry process
 output or logs either.

 Any ideas how I'd go about diagnosing this?

 The Documentum server is on a remote machine administered by a
 different team, that I don't have direct access to, so any tips for
 things I could try at my end before escalating it to them would be
 particularly useful.

 Thanks,

 Andrew.



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg


Re: Diagnosing REJECTED documents in job history

2013-01-21 Thread Andrew Clegg
Just to clarify that last post, I haven't disabled any of the allowed
mime types for ES, so as long as they're not something really weird it
should be fine.

Unless it's a file extension problem (ES also has allowed file
extensions) but is there a way to get that level of information about
each document out of MCF?

Can you enable verbose logging somehow to see what type, size and
extension each processed document was?

On 21 January 2013 11:47, Andrew Clegg andrew.cl...@gmail.com wrote:
 So, the only content types in Documentum are pdf and pdftext.

 application/pdf is enabled in the ES tab in the job config. (I
 assume they both map to application/pdf -- how would I check for
 sure?)

 And my max file size is 16777216000 which is wy bigger than any of
 the rejected documents.

 Sadly it's still rejecting them all.


 On 21 January 2013 11:33, Andrew Clegg andrew.cl...@gmail.com wrote:
 Close, it's ElasticSearch. Okay, I'll play around with these, thanks.

 On 21 January 2013 11:26, Karl Wright daddy...@gmail.com wrote:
 Hi Andrew,

 The reason for rejection has to do with the criteria you provide for
 the job.  Specifically:

   if (activities.checkLengthIndexable(fileLength) 
 activities.checkMimeTypeIndexable(contentType))
   {
 ...

 These are provided by your output connection; in there you may specify
 what mime types and what file length cutoff you want.  From the fact
 that you get these, I am guessing it's a Solr connection.  These
 criteria typically show up on tabs for the job definition.

 Karl

 On Mon, Jan 21, 2013 at 4:52 AM, Andrew Clegg andrew.cl...@gmail.com 
 wrote:
 Hi,

 I'm trying to set up a fairly simple crawl where I pull documents from
 Documentum and push them into ElasticSearch, using the 1.0.1 binary
 release with all appropriate extras for Documentum added.

 The repository connection looks fine -- in the job config I can see
 the paths, document types, content types etc. as expected.

 Also the ES output connection looks fine, it reports connection working.

 However, when I do a crawl, every document it attempts to ingest shows
 this in the job history:

 01-18-2013 17:36:24.279 fetch 0902620580069898 REJECTED 6264431

 (date, time, activity, identifier, result code, bytes, time)

 How can I go about diagnosing what's causing this?

 I can't see anything suspect in the ManifoldCF stdout or log, and
 there's nothing in the Documentum server process or registry process
 output or logs either.

 Any ideas how I'd go about diagnosing this?

 The Documentum server is on a remote machine administered by a
 different team, that I don't have direct access to, so any tips for
 things I could try at my end before escalating it to them would be
 particularly useful.

 Thanks,

 Andrew.



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg