PS if it helps narrow the problem down: I took the entry for just "pdf" out of the allowed mime types for the job, leaving just the standard set of mimetypes that the job config is pre-populated with, and it still happens.
On 1 February 2013 14:36, Andrew Clegg <andrew.cl...@gmail.com> wrote: > Hi Karl, > > The extended logging has helped me find the next problem :-) > > Now I'm seeing hundreds of exceptions like this in the manifold log: > > > FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null > java.lang.NullPointerException > at java.util.TreeMap.getEntry(TreeMap.java:324) > at java.util.TreeMap.containsKey(TreeMap.java:209) > at java.util.TreeSet.contains(TreeSet.java:217) > at > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164) > at > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333) > at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212) > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091) > at > org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811) > at > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556) > > > There'll be a whole batch, then a pause, then another batch. I suspect > this is because MCF is retrying? > > My theory about this is that Documentum is returning the mime type as > just "pdf" instead of "application/pdf" -- although I did add "pdf" as > an allowed mime type in the ElasticSearch page of the job config, just > to see if it would parse this ok. > > Do you know if there's any way to map from a source's content type to > a destination's content type? > > > > On 31 January 2013 23:09, Karl Wright <daddy...@gmail.com> wrote: >> I just chased down and fixed a problem in trunk. ElasticSearch is now >> returning a 201 code for successful indexing in some cases, and the >> connector was not handling that as 'success'. >> >> Karl >> >> >> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddy...@gmail.com> wrote: >>> Please let me know if you see any problems. I'll fix anything you >>> find as quickly as I can. >>> >>> Karl >>> >>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.cl...@gmail.com> >>> wrote: >>>> Great, thanks, I'll give it a try. >>>> >>>> On 30 January 2013 18:52, Karl Wright <daddy...@gmail.com> wrote: >>>>> I just checked in a refactoring to trunk that should improve Elastic >>>>> Search error reporting significantly. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright <daddy...@gmail.com> wrote: >>>>>> I agree that the Elastic Search connector needs far better logging and >>>>>> error handling. CONNECTORS-629. >>>>>> >>>>>> Karl >>>>>> >>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg <andrew.cl...@gmail.com> >>>>>> wrote: >>>>>>> Nailed it with the help of wireshark! Turns out it was my fault -- I >>>>>>> had set it up to use (i.e. create) an index called DocumentumRoW but >>>>>>> it turns out ES index names must be all lowercase. >>>>>>> >>>>>>> Never knew that before. >>>>>>> >>>>>>> Slightly annoyed that ES didn't log that... >>>>>>> >>>>>>> Thanks again for your help Karl :-) >>>>>>> >>>>>>> My only request on the MCF front would be that it would be nice for >>>>>>> the output connector to log the actual status code and content of a >>>>>>> non-successful HTTP response. >>>>>>> >>>>>>> >>>>>>> On 30 January 2013 14:21, Andrew Clegg <andrew.cl...@gmail.com> wrote: >>>>>>>> That information isn't being recorded in manifoldcf.log unfortunately >>>>>>>> -- I included all that was there. And there are no exceptions in >>>>>>>> elasticsearch.log either... >>>>>>>> >>>>>>>> I'll try running wireshark to see if I can follow the TCP stream. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 30 January 2013 14:16, Karl Wright <daddy...@gmail.com> wrote: >>>>>>>>> Ok, ElasticSearch is not happy about something when the document is >>>>>>>>> being posted. The connector is seeing a non-200 HTTP response, and >>>>>>>>> throwing an exception as a result: >>>>>>>>> >>>>>>>>> if (!checkResultCode(method.getStatusCode())) >>>>>>>>> throw new ManifoldCFException(getResultDescription()); >>>>>>>>> >>>>>>>>> Presumably the exception message in the log tells us what that HTTP >>>>>>>>> code is, but you did not include that key info. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg >>>>>>>>> <andrew.cl...@gmail.com> wrote: >>>>>>>>>> Thanks for all your help Karl! >>>>>>>>>> >>>>>>>>>> It's 1.0.1 from the binary distro. >>>>>>>>>> >>>>>>>>>> And yes, it says "Connection working" when I view it. >>>>>>>>>> >>>>>>>>>> On 30 January 2013 14:03, Karl Wright <daddy...@gmail.com> wrote: >>>>>>>>>>> Ok, so let's back up a bit. >>>>>>>>>>> >>>>>>>>>>> First, which version of ManifoldCF is this? I need to know that >>>>>>>>>>> before I can interpret the stack trace. >>>>>>>>>>> >>>>>>>>>>> Second, what do you see when you view the connection in the crawler >>>>>>>>>>> UI? Does it say "Connection working", or something else, and if so, >>>>>>>>>>> what? >>>>>>>>>>> >>>>>>>>>>> I've created a ticket for better error reporting in this connector - >>>>>>>>>>> it was a contribution and AFAIK the error handling is not very >>>>>>>>>>> robust >>>>>>>>>>> at this point, but I can fix that quickly with your help. ;-) >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg >>>>>>>>>>> <andrew.cl...@gmail.com> wrote: >>>>>>>>>>>> On 30 January 2013 13:33, Karl Wright <daddy...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> So you saw events in the history which correspond to these >>>>>>>>>>>>> documents >>>>>>>>>>>>> and which are of type "Indexation" that say "success"? If that >>>>>>>>>>>>> is the >>>>>>>>>>>>> case, then the ElasticSearch connector thinks it handed the >>>>>>>>>>>>> documents >>>>>>>>>>>>> successfully to the ElasticSearch server. >>>>>>>>>>>> >>>>>>>>>>>> Ah, no, the activity is fetch rather than indexation. e.g. >>>>>>>>>>>> >>>>>>>>>>>> 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361 >>>>>>>>>>>> >>>>>>>>>>>> I don't see any history entries relating to indexing as a specific >>>>>>>>>>>> activity in its own right. Sorry, that was probably a red herring, >>>>>>>>>>>> I >>>>>>>>>>>> don't think it's getting that far. >>>>>>>>>>>> >>>>>>>>>>>> I just noticed that above all the "service interruption reported" >>>>>>>>>>>> warnings are some errors like this: >>>>>>>>>>>> >>>>>>>>>>>> ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception >>>>>>>>>>>> tossed: >>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) >>>>>>>>>>>> >>>>>>>>>>>> Sadly there's no description, just a stacktrace. >>>>>>>>>>>> >>>>>>>>>>>> I know the ES server is visible from the MCF server -- actually >>>>>>>>>>>> they're the same machine, and it's configured to use >>>>>>>>>>>> http://127.0.0.1:9200/ as the server URL. And I can go to the >>>>>>>>>>>> command >>>>>>>>>>>> line on that server and curl that URL successfully. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | >>>>>>>>>> http://twitter.com/andrew_clegg >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | >>>>>>>> http://twitter.com/andrew_clegg >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> http://tinyurl.com/andrew-clegg-linkedin | >>>>>>> http://twitter.com/andrew_clegg >>>> >>>> >>>> >>>> -- >>>> >>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg > > > > -- > > http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg