OK, I've checked in a fix to trunk.

Please synch up and try again.
Karl

On Fri, Feb 1, 2013 at 10:10 AM, Karl Wright <daddy...@gmail.com> wrote:
> The problem is that there are some documents you are indexing that
> have no mime type set at all.  The ElasticSearch connector is not
> handling that case properly.  I've opened ticket CONNECTORS-637, and
> will fix it shortly.
>
> Karl
>
> On Fri, Feb 1, 2013 at 9:36 AM, Andrew Clegg <andrew.cl...@gmail.com> wrote:
>> Hi Karl,
>>
>> The extended logging has helped me find the next problem :-)
>>
>> Now I'm seeing hundreds of exceptions like this in the manifold log:
>>
>>
>> FATAL 2013-02-01 14:32:38,255 (Worker thread '5') - Error tossed: null
>> java.lang.NullPointerException
>>         at java.util.TreeMap.getEntry(TreeMap.java:324)
>>         at java.util.TreeMap.containsKey(TreeMap.java:209)
>>         at java.util.TreeSet.contains(TreeSet.java:217)
>>         at 
>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchSpecs.checkMimeType(ElasticSearchSpecs.java:164)
>>         at 
>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.checkMimeTypeIndexable(ElasticSearchConnector.java:333)
>>         at 
>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.checkMimeTypeIndexable(IncrementalIngester.java:212)
>>         at 
>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMimeTypeIndexable(WorkerThread.java:2091)
>>         at 
>> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1811)
>>         at 
>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>         at 
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:556)
>>
>>
>> There'll be a whole batch, then a pause, then another batch. I suspect
>> this is because MCF is retrying?
>>
>> My theory about this is that Documentum is returning the mime type as
>> just "pdf" instead of "application/pdf" -- although I did add "pdf" as
>> an allowed mime type in the ElasticSearch page of the job config, just
>> to see if it would parse this ok.
>>
>> Do you know if there's any way to map from a source's content type to
>> a destination's content type?
>>
>>
>>
>> On 31 January 2013 23:09, Karl Wright <daddy...@gmail.com> wrote:
>>> I just chased down and fixed a problem in trunk.  ElasticSearch is now
>>> returning a 201 code for successful indexing in some cases, and the
>>> connector was not handling that as 'success'.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddy...@gmail.com> wrote:
>>>> Please let me know if you see any problems.  I'll fix anything you
>>>> find as quickly as I can.
>>>>
>>>> Karl
>>>>
>>>> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.cl...@gmail.com> 
>>>> wrote:
>>>>> Great, thanks, I'll give it a try.
>>>>>
>>>>> On 30 January 2013 18:52, Karl Wright <daddy...@gmail.com> wrote:
>>>>>> I just checked in a refactoring to trunk that should improve Elastic
>>>>>> Search error reporting significantly.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright <daddy...@gmail.com> wrote:
>>>>>>> I agree that the Elastic Search connector needs far better logging and
>>>>>>> error handling.  CONNECTORS-629.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg <andrew.cl...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> Nailed it with the help of wireshark! Turns out it was my fault -- I
>>>>>>>> had set it up to use (i.e. create) an index called DocumentumRoW but
>>>>>>>> it turns out ES index names must be all lowercase.
>>>>>>>>
>>>>>>>> Never knew that before.
>>>>>>>>
>>>>>>>> Slightly annoyed that ES didn't log that...
>>>>>>>>
>>>>>>>> Thanks again for your help Karl :-)
>>>>>>>>
>>>>>>>> My only request on the MCF front would be that it would be nice for
>>>>>>>> the output connector to log the actual status code and content of a
>>>>>>>> non-successful HTTP response.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 30 January 2013 14:21, Andrew Clegg <andrew.cl...@gmail.com> wrote:
>>>>>>>>> That information isn't being recorded in manifoldcf.log unfortunately
>>>>>>>>> -- I included all that was there. And there are no exceptions in
>>>>>>>>> elasticsearch.log either...
>>>>>>>>>
>>>>>>>>> I'll try running wireshark to see if I can follow the TCP stream.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 30 January 2013 14:16, Karl Wright <daddy...@gmail.com> wrote:
>>>>>>>>>> Ok, ElasticSearch is not happy about something when the document is
>>>>>>>>>> being posted.  The connector is seeing a non-200 HTTP response, and
>>>>>>>>>> throwing an exception as a result:
>>>>>>>>>>
>>>>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>>>>
>>>>>>>>>> Presumably the exception message in the log tells us what that HTTP
>>>>>>>>>> code is, but you did not include that key info.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg 
>>>>>>>>>> <andrew.cl...@gmail.com> wrote:
>>>>>>>>>>> Thanks for all your help Karl!
>>>>>>>>>>>
>>>>>>>>>>> It's 1.0.1 from the binary distro.
>>>>>>>>>>>
>>>>>>>>>>> And yes, it says "Connection working" when I view it.
>>>>>>>>>>>
>>>>>>>>>>> On 30 January 2013 14:03, Karl Wright <daddy...@gmail.com> wrote:
>>>>>>>>>>>> Ok, so let's back up a bit.
>>>>>>>>>>>>
>>>>>>>>>>>> First, which version of ManifoldCF is this?  I need to know that
>>>>>>>>>>>> before I can interpret the stack trace.
>>>>>>>>>>>>
>>>>>>>>>>>> Second, what do you see when you view the connection in the crawler
>>>>>>>>>>>> UI?  Does it say "Connection working", or something else, and if 
>>>>>>>>>>>> so,
>>>>>>>>>>>> what?
>>>>>>>>>>>>
>>>>>>>>>>>> I've created a ticket for better error reporting in this connector 
>>>>>>>>>>>> -
>>>>>>>>>>>> it was a contribution and AFAIK the error handling is not very 
>>>>>>>>>>>> robust
>>>>>>>>>>>> at this point, but I can fix that quickly with your help. ;-)
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg 
>>>>>>>>>>>> <andrew.cl...@gmail.com> wrote:
>>>>>>>>>>>>> On 30 January 2013 13:33, Karl Wright <daddy...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So you saw events in the history which correspond to these 
>>>>>>>>>>>>>> documents
>>>>>>>>>>>>>> and which are of type "Indexation" that say "success"?  If that 
>>>>>>>>>>>>>> is the
>>>>>>>>>>>>>> case, then the ElasticSearch connector thinks it handed the 
>>>>>>>>>>>>>> documents
>>>>>>>>>>>>>> successfully to the ElasticSearch server.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ah, no, the activity is fetch rather than indexation. e.g.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success 549541 361
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't see any history entries relating to indexing as a specific
>>>>>>>>>>>>> activity in its own right. Sorry, that was probably a red 
>>>>>>>>>>>>> herring, I
>>>>>>>>>>>>> don't think it's getting that far.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I just noticed that above all the "service interruption reported"
>>>>>>>>>>>>> warnings are some errors like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ERROR 2013-01-30 13:44:15,356 (Worker thread '45') - Exception 
>>>>>>>>>>>>> tossed:
>>>>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>>>>         at 
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sadly there's no description, just a stacktrace.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I know the ES server is visible from the MCF server -- actually
>>>>>>>>>>>>> they're the same machine, and it's configured to use
>>>>>>>>>>>>> http://127.0.0.1:9200/ as the server URL. And I can go to the 
>>>>>>>>>>>>> command
>>>>>>>>>>>>> line on that server and curl that URL successfully.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | 
>>>>>>>>>>> http://twitter.com/andrew_clegg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | 
>>>>>>>>> http://twitter.com/andrew_clegg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | 
>>>>>>>> http://twitter.com/andrew_clegg
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Reply via email to