On Sun, 4 Mar 2012 21:09:30 -0500, Mark Miller <markrmil...@gmail.com> wrote:
On Mar 4, 2012, at 5:43 PM, Markus Jelsma wrote:

everything stalls after it lists all segment files and that a ZK state change has occured.

Can you get a stack trace here? I'll try to respond to more tomorrow.
What version of trunk are you using? We have been making fixes and
improvements all the time, so need to get a frame of reference.

I updated trunk this saterday, march 3rd. The stack traces i provided is all i got. This point of restart and stalling does not produce a stack trace at all. This is the final part of the info log:

[lots of segment files]...fdt, _135.fdt, _199_nrm.cfs, _18s.tvd, _zm.fdx, _18s.tvf, _196_0.frq, _135.tvf, _195.fdt, _135.tvd, _18n.tvf, _18n.tvd, _18y_0.tim, _18s.tvx, _zm.fnm, _187.tvx, _10g.fnm, _13t.per, _195.fdx]2012-03-04 22:39:15,061 INFO [solr.core.SolrCore] - [recoveryExecutor-2-thread-1] - : newest commit = 312012-03-04 22:39:16,052 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred2012-03-04 22:39:16,585 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred2012-03-04 22:39:36,652 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred2012-03-04 22:39:52,220 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred


When a client node cannot talk to zookeeper, because it may not know
certain things it should (what if a leader changes?), it must reject
updates (searches will still work). Why can't the node talk to
zookeeper? Perhaps the load is so high on the server, it cannot
respond to zk within the session timeout? I really don't know yet.
When this happens though, it forces a recovery when/if the node can
reconnect to zookeeper.

Sounds likely. There are a lot of time outs in ZK's log such as :

EndOfStreamException: Unable to read additional data from client sessionid 0x135dfcfda000012, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
        at java.lang.Thread.run(Thread.java:662)
2012-03-04 22:37:38,956 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
 client /141.105.120.151:49833 which had sessionid 0x135dfcfda000012
2012-03-04 22:37:39,077 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x135dfcfda000010, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
        at java.lang.Thread.run(Thread.java:662)
2012-03-04 22:37:39,077 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /141.105.120.153:36794 which had sessionid 0x135dfcfda000010 2012-03-04 22:37:48,000 [myid:] - INFO [SessionTracker:ZooKeeperServer@334] - Expiring session 0x135dfcfda000011, timeout of 10000ms exceeded

The problems seem to have a lot to do with ZK as we always see bad messsages in its log around the time Solr is going crazy.


We have not yet started on optimizing bulk indexing - currently an
update is added locally *before* sending updates in parallel to each
replica. Then we wait for each response before responding to the
client. We plan to offer more optimizations and options around this.


This is indeed a bit of a problem but at least it's indexing. If there's any additional information you need or want us to pull in new commits and try again we're happy to give it a shot.


Feed back will be useful in making some of these improvements.


- Mark Miller
lucidimagination.com

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Reply via email to