The accumulo batch writer will re-send mutations if a tablet server fails, or rejects the mutations because the tablet has moved. There's nothing you have to do to recover from fail-overs and re-balancing.
I'm not a kernel expert, but I believe that a swappiness setting of "1" is equivalent to "0". The error you are seeing is part of the failing tablet server scenario. This is a bit complicated, so I'm going to name your three tablet servers A, B and C. Tablet server A is hosting a tablet, let's call it a-tablet. Tablet server B is hosting a metadata tablet, let's call it m-tablet. m-tablet records the information about a-tablet: - where it is hosted - what files it it has, and their approximate sizes - book-keeping related to bulk ingest - etc.. I think the OReilly Accumulo book has some great details Now when A ingests some data, it eventually flushes the updates from memory to a file. Tablet server A then writes this new information to m-tablet, on Tablet server B. Now for the failure: Tablet server A does a java memory garbage collection, and starts pulling data from swap. That makes it go really slow, and it looses its zookeeper session. But, it's running so slowly, that it takes a moment to realize it should die. In the mean time, the thread that is flushing memory, attempts to update m-tablet with the new file information. Fortunately there's a constraint on m-tablet. The constraint is that mutations must contain a valid zookeeper session. This prevents tablet server A from making updates to m-tablet when it no long has the right to host the tablet. Your initial error is from tablet server A making an update to tablet server B's m-tablet. It's getting a constraint violation: tablet server A has lost its zookeeper session, and will fail momentarily. To make this extra confusing: A and B might be the same server. -Eric On Tue, Dec 22, 2015 at 11:31 PM, mohit.kaushik <mohit.kaus...@orkash.com> wrote: > > I have 3 tablet servers having around 1.4K tablets. If a tablet server > loses its session with zookeeper and killed itself. The system takes some > time to move all hosted tablets to other servers. > > In this case if a ingest in process then what should happen with the > mutations going to tablets hosted by that tablet server? > Is it the reason for the first exception?Should they not be redirected to > other servers? > nd I had set the system swappiness to 1. Should I keep it 0 in this case? > I will check further. > > Thanks for the reply > > -Mohit Kaushik > > > On 12/22/2015 08:17 PM, Eric Newton wrote: > > A tablet server is given the rights to manage a tablet. > > It is critical that no other server uses the tablet to maintain > consistency. > > To maintain the right to access a tablet, it must maintain a zookeeper > session. The zookeeper session periodically exchanges keep-alive messages. > If either party fails to get a keep-alive, zookeeper will close the > connection. The client can attempt to reconnect, but if it fails to do so, > the session will timeout. > > If the tablet server loses its session with zookeeper, the rest of the > system can take over its tablets. > > When a tablet detects that it lost its zookeeper session, it kills itself > to avoid doing anything with the tablets it no long has the right to host. > > What you are seeing here is the first step in that process, and it is > probably due to the tablet server not sending a keep-alive message to > zookeeper in time. > > There are many reasons for a tablet server to be delayed in sending a > keep-alive message. By far the most common is that your system is > over-subscribed for memory, and part of the tablet server's memory swapped > out. Once the java garbage collection cycle swapped it back in, there was a > considerable delay. > > However, there can be other things going on. This is just a best guess. > Monitor swap usage, as a first diagnostic step. > > -Eric > > > > On Tue, Dec 22, 2015 at 8:30 AM, mohit.kaushik <mohit.kaus...@orkash.com> > wrote: > >> Dear All, >> >> The mutations rejected exception can be seen at client side with server >> error 1. >> *org.apache.accumulo.core.client.MutationsRejectedException: # constraint >> violations : 0 security codes: {} # server errors 1 # exceptions 1\n\tat >> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)\n\tat >> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)\n\tat >> org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)\n\tat >> com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)\n\tat >> com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570)\n\tat >> com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145)\n\tat >> java.lang.Thread.run(Thread.java:745)\nCaused by: >> org.apache.accumulo.core.client.impl.AccumuloServerException: Error on >> server orkash1:9997\n\tat * >> >> I also found exceptions in Monitor related to Tracing. >> >> *Tracing spans are being dropped because there are already 5000 spans queued >> for delivery. >> This does not affect performance, security or data integrity, but >> distributed tracing information is being lost.**and **6458 times**Got an >> IOException in internalRead! >> java.io.IOException: Connection reset by peer >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >> at >> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141) >> at >> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:537) >> at >> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338) >> at >> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203) >> at >> org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.select(CustomNonBlockingServer.java:228) >> at >> org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.run* >> >> >> >> I am facing the following exceptions in tserver logs and one tserver goes >> dead. >> >> *2015-12-22 09:37:27,173 [zookeeper.ZooCache] WARN : Saw (possibly) >> transient exception communicating with ZooKeeper, will retry* >> *org.apache.zookeeper.KeeperException$ConnectionLossException: >> KeeperErrorCode = ConnectionLoss for >> /accumulo/f8708e0d-9238-41f5-b948-8f435fd01207/tables/16/conf/table.split.threshold* >> * at >> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)* >> * at >> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)* >> * at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)* >> * at >> org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)* >> * at >> org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)* >> * at >> org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)* >> * at >> org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)* >> * at >> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:117)* >> * at >> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:103)* >> * at >> org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:99)* >> * at >> org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:197)* >> * at >> org.apache.accumulo.tserver.tablet.Tablet.findSplitRow(Tablet.java:1604)* >> * at >> org.apache.accumulo.tserver.tablet.Tablet.needsSplit(Tablet.java:1772)* >> * at >> org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:1853)* >> * at >> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)* >> * at java.lang.Thread.run(Thread.java:745)* >> >> These are creating problems in continuously ingesting data and I also >> experienced some delay in queries and table create commands. >> Please comment what could be the cause of these exceptions? >> >> Thanks >> Mohit Kaushik >> >> > > > -- > > * Mohit Kaushik* > Software Engineer > A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India > *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553 > > <http://politicomapper.orkash.com>interactive social intelligence at > work... > > <https://www.facebook.com/Orkash2012> > <http://www.linkedin.com/company/orkash-services-private-limited> > <https://twitter.com/Orkash> <http://www.orkash.com/blog/> > <http://www.orkash.com> > <http://www.orkash.com> ... ensuring Assurance in complexity and > uncertainty > > *This message including the attachments, if any, is a confidential > business communication. If you are not the intended recipient it may be > unlawful for you to read, copy, distribute, disclose or otherwise use the > information in this e-mail. If you have received it in error or are not the > intended recipient, please destroy it and notify the sender immediately. > Thank you * >