Erick - thank you very much for the reply.  I'm still working through restarting the nodes one by one.

I'm stopping the nodes with the script, but yes - they are being killed forcefully because they are in this recovery, failed, retry loop.  I could increase the timeout, but they never seem to recover.

The largest tlog file that I see currently is 222MBytes. Autocommit is set to 1800000 and autoSoftCommit is set to 120000. Errors when they are in the long recovery are things like:

2017-11-20 21:41:29.755 ERROR (recoveryExecutor-3-thread-4-processing-n:frodo:9100_solr x:UNCLASS_shard37_replica1 s:shard37 c:UNCLASS r:core_node196) [c:UNCLASS s:shard37 r:core_node196 x:UNCLASS_shard37_replica1] o.a.s.h.IndexFetcher Error closing file: _8dmn.cfs org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /solr6.6.0/UNCLASS/core_node196/data/index.20171120195705961/_8dmn.cfs could only be replicated to 0 nodes instead of minReplication (=1).  There are 39 datanode(s) running and no node(s) are excluded in this operation.         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1716)         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3385)

Complete log is here for one of the shards that was forcefully stopped.
http://lovehorsepower.com/solr.log

As to what is in the logs when it is recovering for several hours, it's many WARN messages from the DFSClient such as: Abandoning BP-1714598269-10.2.100.220-1341346046854:blk_4366207808_1103082741732
and
Excluding datanode DatanodeInfoWithStorage[172.16.100.229:50010,​DS-5985e40d-830a-44e7-a2ea-fc60bebabc30,​DISK]

or from the IndexFetcher:

File _a96y.cfe did not match. expected checksum is 3502268220 and actual is checksum 2563579651. expected length is 405 and actual length is 405

Unfortunately, I'm not getting errors from some of the nodes (still working through restarting them) about zookeeper:

org.apache.solr.common.SolrException: Could not get leader props
    at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)     at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
    at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)     at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader     at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)     at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)     at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)     at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)     at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)

Any idea what those could be?  Those shards are not coming back up. Sorry so many questions!

-Joe

On 11/21/2017 12:11 PM, Erick Erickson wrote:
How are you stopping Solr? Nodes should not go into recovery on
startup unless Solr was killed un-gracefully (i.e. kill -9 or the
like). If you use the bin/solr script to stop Solr and see a message
about "killing XXX forcefully" then you can lengthen out the time the
script waits for shutdown (there's a sysvar you can set, look in the
script).

Actually I'll correct myself a bit. Shards _do_ go into recovery but
it should be very short in the graceful shutdown case. Basically
shards temporarily go into recovery as part of normal processing just
long enough to see there's no recovery necessary, but that should be
measured in a few seconds.

What it sounds like from this "The shards go into recovery and start
to utilize nearly all of their network" is that your nodes go into
"full recovery" where the entire index is copied down because the
replica thinks it's "too far" out of date. That indicates something
weird about the state when the Solr nodes stopped.

wild-shot-in-the-dark question. How big are your tlogs? If you don't
hard commit very often, the tlogs can replay at startup for a very
long time.

This makes no sense to me, I'm surely missing something:

The process at this point is to start one node, find out the lock
files, wait for it to come up completely (hours), stop it, delete the
write.lock files, and restart.  Usually this second restart is faster,
but it still can take 20-60 minutes.

When you start one node it may take a few minutes for leader electing
to kick in (the default is 180 seconds) but absent replication it
should just be there. Taking hours totally violates my expectations.
What does Solr _think_ it's doing? What's in the logs at that point?
And if you stop solr gracefully, there shouldn't be a problem with
write.lock.

You could also try increasing the timeouts, and the HDFS directory
factory has some parameters to tweak that are a mystery to me...

All in all, this is behavior that I find mystifying.

Best,
Erick

On Tue, Nov 21, 2017 at 5:07 AM, Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
Hi All - we have a system with 45 physical boxes running solr 6.6.1 using
HDFS as the index.  The current index size is about 31TBytes.  With 3x
replication that takes up 93TBytes of disk. Our main collection is split
across 100 shards with 3 replicas each.  The issue that we're running into
is when restarting the solr6 cluster.  The shards go into recovery and start
to utilize nearly all of their network interfaces.  If we start too many of
the nodes at once, the shards will go into a recovery, fail, and retry loop
and never come up.  The errors are related to HDFS not responding fast
enough and warnings from the DFSClient.  If we stop a node when this is
happening, the script will force a stop (180 second timeout) and upon
restart, we have lock files (write.lock) inside of HDFS.

The process at this point is to start one node, find out the lock files,
wait for it to come up completely (hours), stop it, delete the write.lock
files, and restart.  Usually this second restart is faster, but it still can
take 20-60 minutes.

The smaller indexes recover much faster (less than 5 minutes). Should we
have not used so many replicas with HDFS?  Is there a better way we should
have built the solr6 cluster?

Thank you for any insight!

-Joe

---
This email has been checked for viruses by AVG.
http://www.avg.com


Reply via email to