On Sun, 4 Mar 2012 21:09:30 -0500, Mark Miller <markrmil...@gmail.com>
wrote:
On Mar 4, 2012, at 5:43 PM, Markus Jelsma wrote:
everything stalls after it lists all segment files and that a ZK
state change has occured.
Can you get a stack trace here? I'll try to respond to more tomorrow.
What version of trunk are you using? We have been making fixes and
improvements all the time, so need to get a frame of reference.
I updated trunk this saterday, march 3rd. The stack traces i provided
is all i got. This point of restart and stalling does not produce a
stack trace at all. This is the final part of the info log:
[lots of segment files]...fdt, _135.fdt, _199_nrm.cfs, _18s.tvd,
_zm.fdx, _18s.tvf, _196_0.frq, _135.tvf, _195.fdt, _135.tvd, _18n.tvf,
_18n.tvd, _18y_0.tim, _18s.tvx, _zm.fnm, _187.tvx, _10g.fnm, _13t.per,
_195.fdx]2012-03-04 22:39:15,061 INFO [solr.core.SolrCore] -
[recoveryExecutor-2-thread-1] - : newest commit = 312012-03-04
22:39:16,052 INFO [common.cloud.ZkStateReader] - [main-EventThread] - :
A cluster state change has occurred2012-03-04 22:39:16,585 INFO
[common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state
change has occurred2012-03-04 22:39:36,652 INFO
[common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state
change has occurred2012-03-04 22:39:52,220 INFO
[common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state
change has occurred
When a client node cannot talk to zookeeper, because it may not know
certain things it should (what if a leader changes?), it must reject
updates (searches will still work). Why can't the node talk to
zookeeper? Perhaps the load is so high on the server, it cannot
respond to zk within the session timeout? I really don't know yet.
When this happens though, it forces a recovery when/if the node can
reconnect to zookeeper.
Sounds likely. There are a lot of time outs in ZK's log such as :
EndOfStreamException: Unable to read additional data from client
sessionid 0x135dfcfda000012, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
at java.lang.Thread.run(Thread.java:662)
2012-03-04 22:37:38,956 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed
socket connection for
client /141.105.120.151:49833 which had sessionid 0x135dfcfda000012
2012-03-04 22:37:39,077 [myid:] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught
end of stream exception
EndOfStreamException: Unable to read additional data from client
sessionid 0x135dfcfda000010, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
at java.lang.Thread.run(Thread.java:662)
2012-03-04 22:37:39,077 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed
socket connection for client /141.105.120.153:36794 which had sessionid
0x135dfcfda000010
2012-03-04 22:37:48,000 [myid:] - INFO
[SessionTracker:ZooKeeperServer@334] - Expiring session
0x135dfcfda000011, timeout of 10000ms exceeded
The problems seem to have a lot to do with ZK as we always see bad
messsages in its log around the time Solr is going crazy.
We have not yet started on optimizing bulk indexing - currently an
update is added locally *before* sending updates in parallel to each
replica. Then we wait for each response before responding to the
client. We plan to offer more optimizations and options around this.
This is indeed a bit of a problem but at least it's indexing. If
there's any additional information you need or want us to pull in new
commits and try again we're happy to give it a shot.
Feed back will be useful in making some of these improvements.
- Mark Miller
lucidimagination.com
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350