On 2/16/2012 6:28 PM, Mark Miller wrote:
Im not sure that timeout will help you here - I believe it's the timeout on
'creating' the connection.

Try setting the socket timeout (setSoTimeout) - that should let you try
sooner.

It looks like perhaps the server is timing out and closing the connection.

I guess all you can do is timeout reasonably (if it takes too long to we
for the exception) and retry.

When the timeout exception happens, it is happening within the same second as the beginning of the update cycle, which involves a lot of other things happening (such as talking to a database) before it even gets around to talking to Solr. I do not have millisecond timestamps, but from what little I can tell, it's a handful of milliseconds from when SolrJ starts the request until the exception is logged. It happens relatively rarely - no more than once every few days, usually less often than that. I cannot reproduce it at will. Nobody is doing any work on either Solr or the network when it happens. Nothing is logged in the Solr server log or syslog at the OS level, the only mention of anything bad going on is in the log of my SolrJ application.

I never had this problem when my build system was written in Perl, using LWP to make HTTP requests with URLs that I constructed myself. The perl system ran on CentOS 5 with Xen virtualization, now I'm running CentOS 6 on the bare metal. I'm using a bonded interface (for failover, not load balancing) comprised of two NICs plugged into separate switches. When it was virtualized, the Xen host was also using an identically configured bonded interface, bridged to the guests, which used eth0.

The last time the error happened, which was on Feb 15th at 2:04 PM MST, the query that failed was 'did:(289800299 OR 289800157)', a very simple query against a tlong field. The application tests for the existence of the did values that it is trying to delete before it issues the delete request.

I'm willing to look deeper into possible networking issues, but I am skeptical about that being the problem, and because there are no log messages to investigate, I have no idea how to proceed. The application runs on one of four Solr servers, sometimes the error even happens when connecting to Solr on the same server it's running on, which takes the gigabit switches out of the equation. If it's an actual networking problem, it's either in the hardware (Dell PowerEdge 2950 III, built-in NICs) or the CentOS 6 kernel.

At this point, I am thinking it's one of the following problems, in order of decreasing probability: 1) I am using SolrJ incorrectly. 2) There is a SolrJ problem that only appears under specific circumstances that happen to exist in my setup. 3) My hardware or OS software has an extremely intermittent problem.

What other info can I provide?

Thanks,
Shawn

Reply via email to