Re: Diagnosing REJECTED documents in job history

2013-02-04 Thread Andrew Clegg
Sadly, I did a completely fresh build, with a new database, and I still get REJECTED for all the documents found, with no log messages. I also tried upgrading my DFC jars to those from Documentum 6.7 as one of my colleagues pointed out that we use 6.6 which doesn't officially support

Re: Diagnosing REJECTED documents in job history

2013-02-01 Thread Karl Wright
The problem is that there are some documents you are indexing that have no mime type set at all. The ElasticSearch connector is not handling that case properly. I've opened ticket CONNECTORS-637, and will fix it shortly. Karl On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg andrew.cl...@gmail.com

Re: Diagnosing REJECTED documents in job history

2013-02-01 Thread Karl Wright
OK, I've checked in a fix to trunk. Please synch up and try again. Karl On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright daddy...@gmail.com wrote: The problem is that there are some documents you are indexing that have no mime type set at all. The ElasticSearch connector is not handling that

Re: Diagnosing REJECTED documents in job history

2013-01-31 Thread Andrew Clegg
Great, thanks, I'll give it a try. On 30 January 2013 18:52, Karl Wright daddy...@gmail.com wrote: I just checked in a refactoring to trunk that should improve Elastic Search error reporting significantly. Karl On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote: I agree

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Andrew Clegg
Hi Karl, I finally had a chance to go back to this and here's what I found. Documentum was returning pdf and pdftext for the content type, not a full mime type, so as an experiment I added these to the list of allowed mime types in the ElasticSearch configuration for the job. This time, it got

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
Ok, so let's back up a bit. First, which version of ManifoldCF is this? I need to know that before I can interpret the stack trace. Second, what do you see when you view the connection in the crawler UI? Does it say Connection working, or something else, and if so, what? I've created a ticket

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Andrew Clegg
Thanks for all your help Karl! It's 1.0.1 from the binary distro. And yes, it says Connection working when I view it. On 30 January 2013 14:03, Karl Wright daddy...@gmail.com wrote: Ok, so let's back up a bit. First, which version of ManifoldCF is this? I need to know that before I can

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Andrew Clegg
That information isn't being recorded in manifoldcf.log unfortunately -- I included all that was there. And there are no exceptions in elasticsearch.log either... I'll try running wireshark to see if I can follow the TCP stream. On 30 January 2013 14:16, Karl Wright daddy...@gmail.com wrote:

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Andrew Clegg
Nailed it with the help of wireshark! Turns out it was my fault -- I had set it up to use (i.e. create) an index called DocumentumRoW but it turns out ES index names must be all lowercase. Never knew that before. Slightly annoyed that ES didn't log that... Thanks again for your help Karl :-)

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
I agree that the Elastic Search connector needs far better logging and error handling. CONNECTORS-629. Karl On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Nailed it with the help of wireshark! Turns out it was my fault -- I had set it up to use (i.e. create) an

Re: Diagnosing REJECTED documents in job history

2013-01-30 Thread Karl Wright
I just checked in a refactoring to trunk that should improve Elastic Search error reporting significantly. Karl On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright daddy...@gmail.com wrote: I agree that the Elastic Search connector needs far better logging and error handling. CONNECTORS-629. Karl

Diagnosing REJECTED documents in job history

2013-01-21 Thread Andrew Clegg
Hi, I'm trying to set up a fairly simple crawl where I pull documents from Documentum and push them into ElasticSearch, using the 1.0.1 binary release with all appropriate extras for Documentum added. The repository connection looks fine -- in the job config I can see the paths, document types,

Re: Diagnosing REJECTED documents in job history

2013-01-21 Thread Andrew Clegg
Close, it's ElasticSearch. Okay, I'll play around with these, thanks. On 21 January 2013 11:26, Karl Wright daddy...@gmail.com wrote: Hi Andrew, The reason for rejection has to do with the criteria you provide for the job. Specifically: if

Re: Diagnosing REJECTED documents in job history

2013-01-21 Thread Andrew Clegg
So, the only content types in Documentum are pdf and pdftext. application/pdf is enabled in the ES tab in the job config. (I assume they both map to application/pdf -- how would I check for sure?) And my max file size is 16777216000 which is wy bigger than any of the rejected documents.

Re: Diagnosing REJECTED documents in job history

2013-01-21 Thread Andrew Clegg
Just to clarify that last post, I haven't disabled any of the allowed mime types for ES, so as long as they're not something really weird it should be fine. Unless it's a file extension problem (ES also has allowed file extensions) but is there a way to get that level of information about each