OK, I've checked in a fix to trunk. Please synch up and try again. Karl
On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddy...@gmail.com> wrote: > The problem is that there are some documents you are indexing that > have no mime type set at all. The ElasticSearch connector is not > handling that case properly. I've opened ticket CONNECTORS-637, and > will fix it shortly. > > Karl > > On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.cl...@gmail.com> wrote: >> Hi Karl, >> >> The extended logging has helped me find the next problem :-) >> >> Now I'm seeing hundreds of exceptions like this in the manifold log: >> >> >> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null >> java.lang.NullPointerException >> at java.util.TreeMap.getEntry(TreeMap.java:324) >> at java.util.TreeMap.containsKey(TreeMap.java:209) >> at java.util.TreeSet.contains(TreeSet.java:217) >> at >> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164) >> at >> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333) >> at >> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212) >> at >> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091) >> at >> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811) >> at >> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) >> at >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556) >> >> >> There'll be a whole batch, then a pause, then another batch. I suspect >> this is because MCF is retrying? >> >> My theory about this is that Documentum is returning the mime type as >> just "pdf" instead of "application/pdf" -- although I did add "pdf" as >> an allowed mime type in the ElasticSearch page of the job config, just >> to see if it would parse this ok. >> >> Do you know if there's any way to map from a source's content type to >> a destination's content type? >> >> >> >> On 31 January 2013 23:09, Karl Wright <daddy...@gmail.com> wrote: >>> I just chased down and fixed a problem in trunk. ElasticSearch is now >>> returning a 201 code for successful indexing in some cases, and the >>> connector was not handling that as 'success'. >>> >>> Karl >>> >>> >>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddy...@gmail.com> wrote: >>>> Please let me know if you see any problems. I'll fix anything you >>>> find as quickly as I can. >>>> >>>> Karl >>>> >>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.cl...@gmail.com> >>>> wrote: >>>>> Great, thanks, I'll give it a try. >>>>> >>>>> On 30 January 2013 18:52, Karl Wright <daddy...@gmail.com> wrote: >>>>>> I just checked in a refactoring to trunk that should improve Elastic >>>>>> Search error reporting significantly. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright <daddy...@gmail.com> wrote: >>>>>>> I agree that the Elastic Search connector needs far better logging and >>>>>>> error handling. CONNECTORS-629. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg <andrew.cl...@gmail.com> >>>>>>> wrote: >>>>>>>> Nailed it with the help of wireshark! Turns out it was my fault -- I >>>>>>>> had set it up to use (i.e. create) an index called DocumentumRoW but >>>>>>>> it turns out ES index names must be all lowercase. >>>>>>>> >>>>>>>> Never knew that before. >>>>>>>> >>>>>>>> Slightly annoyed that ES didn't log that... >>>>>>>> >>>>>>>> Thanks again for your help Karl :-) >>>>>>>> >>>>>>>> My only request on the MCF front would be that it would be nice for >>>>>>>> the output connector to log the actual status code and content of a >>>>>>>> non-successful HTTP response. >>>>>>>> >>>>>>>> >>>>>>>> On 30 January 2013 14:21, Andrew Clegg <andrew.cl...@gmail.com> wrote: >>>>>>>>> That information isn't being recorded in manifoldcf.log unfortunately >>>>>>>>> -- I included all that was there. And there are no exceptions in >>>>>>>>> elasticsearch.log either... >>>>>>>>> >>>>>>>>> I'll try running wireshark to see if I can follow the TCP stream. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 30 January 2013 14:16, Karl Wright <daddy...@gmail.com> wrote: >>>>>>>>>> Ok, ElasticSearch is not happy about something when the document is >>>>>>>>>> being posted. The connector is seeing a non-200 HTTP response, and >>>>>>>>>> throwing an exception as a result: >>>>>>>>>> >>>>>>>>>> if (!checkResultCode(method.getStatusCode())) >>>>>>>>>> throw new ManifoldCFException(getResultDescription()); >>>>>>>>>> >>>>>>>>>> Presumably the exception message in the log tells us what that HTTP >>>>>>>>>> code is, but you did not include that key info. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg >>>>>>>>>> <andrew.cl...@gmail.com> wrote: >>>>>>>>>>> Thanks for all your help Karl! >>>>>>>>>>> >>>>>>>>>>> It's 1.0.1 from the binary distro. >>>>>>>>>>> >>>>>>>>>>> And yes, it says "Connection working" when I view it. >>>>>>>>>>> >>>>>>>>>>> On 30 January 2013 14:03, Karl Wright <daddy...@gmail.com> wrote: >>>>>>>>>>>> Ok, so let's back up a bit. >>>>>>>>>>>> >>>>>>>>>>>> First, which version of ManifoldCF is this? I need to know that >>>>>>>>>>>> before I can interpret the stack trace. >>>>>>>>>>>> >>>>>>>>>>>> Second, what do you see when you view the connection in the crawler >>>>>>>>>>>> UI? Does it say "Connection working", or something else, and if >>>>>>>>>>>> so, >>>>>>>>>>>> what? >>>>>>>>>>>> >>>>>>>>>>>> I've created a ticket for better error reporting in this connector >>>>>>>>>>>> - >>>>>>>>>>>> it was a contribution and AFAIK the error handling is not very >>>>>>>>>>>> robust >>>>>>>>>>>> at this point, but I can fix that quickly with your help. ;-) >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg >>>>>>>>>>>> <andrew.cl...@gmail.com> wrote: >>>>>>>>>>>>> On 30 January 2013 13:33, Karl Wright <daddy...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> So you saw events in the history which correspond to these >>>>>>>>>>>>>> documents >>>>>>>>>>>>>> and which are of type "Indexation" that say "success"? If that >>>>>>>>>>>>>> is the >>>>>>>>>>>>>> case, then the ElasticSearch connector thinks it handed the >>>>>>>>>>>>>> documents >>>>>>>>>>>>>> successfully to the ElasticSearch server. >>>>>>>>>>>>> >>>>>>>>>>>>> Ah, no, the activity is fetch rather than indexation. e.g. >>>>>>>>>>>>> >>>>>>>>>>>>> 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361 >>>>>>>>>>>>> >>>>>>>>>>>>> I don't see any history entries relating to indexing as a specific >>>>>>>>>>>>> activity in its own right. Sorry, that was probably a red >>>>>>>>>>>>> herring, I >>>>>>>>>>>>> don't think it's getting that far. >>>>>>>>>>>>> >>>>>>>>>>>>> I just noticed that above all the "service interruption reported" >>>>>>>>>>>>> warnings are some errors like this: >>>>>>>>>>>>> >>>>>>>>>>>>> ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception >>>>>>>>>>>>> tossed: >>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) >>>>>>>>>>>>> at >>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) >>>>>>>>>>>>> >>>>>>>>>>>>> Sadly there's no description, just a stacktrace. >>>>>>>>>>>>> >>>>>>>>>>>>> I know the ES server is visible from the MCF server -- actually >>>>>>>>>>>>> they're the same machine, and it's configured to use >>>>>>>>>>>>> http://127.0.0.1:9200/ as the server URL. And I can go to the >>>>>>>>>>>>> command >>>>>>>>>>>>> line on that server and curl that URL successfully. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | >>>>>>>>>>> http://twitter.com/andrew_clegg >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | >>>>>>>>> http://twitter.com/andrew_clegg >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | >>>>>>>> http://twitter.com/andrew_clegg >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg >> >> >> >> -- >> >> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg