The accumulo batch writer will re-send mutations if a tablet server fails,
or rejects the mutations because the tablet has moved.  There's nothing you
have to do to recover from fail-overs and re-balancing.

I'm not a kernel expert, but I believe that a swappiness setting of "1" is
equivalent to "0".

The error you are seeing is part of the failing tablet server scenario.
This is a bit complicated, so I'm going to name your three tablet servers
A, B and C.

Tablet server A is hosting a tablet, let's call it a-tablet.
Tablet server B is hosting a metadata tablet, let's call it m-tablet.
m-tablet records the information about a-tablet:

   - where it is hosted
   - what files it it has, and their approximate sizes
   - book-keeping related to bulk ingest
   - etc.. I think the OReilly Accumulo book has some great details

Now when A ingests some data, it eventually flushes the updates from memory
to a file.
Tablet server A then writes this new information to m-tablet, on Tablet
server B.

Now for the failure:
Tablet server A does a java memory garbage collection, and starts pulling
data from swap. That makes it go really slow, and it looses its zookeeper
session.

But, it's running so slowly, that it takes a moment to realize it should
die.

In the mean time, the thread that is flushing memory, attempts to update
m-tablet with the new file information.

Fortunately there's a constraint on m-tablet. The constraint is that
mutations must contain a valid zookeeper session.  This prevents tablet
server A from making updates to m-tablet when it no long has the right to
host the tablet.

Your initial error is from tablet server A making an update to tablet
server B's m-tablet.  It's getting a constraint violation: tablet server A
has lost its zookeeper session, and will fail momentarily.

To make this extra confusing: A and B might be the same server.

-Eric


On Tue, Dec 22, 2015 at 11:31 PM, mohit.kaushik <mohit.kaus...@orkash.com>
wrote:

>
> I have 3 tablet servers having around 1.4K tablets. If a tablet server
> loses its session with zookeeper and killed itself. The system takes some
> time to move all hosted tablets to other servers.
>
> In this case if a ingest in process then what should happen with the
> mutations going to tablets hosted by that tablet server?
> Is it the reason for the first exception?Should they not be redirected to
> other servers?
> nd I had set the system swappiness to 1. Should I keep it 0 in this case?
> I will check further.
>
> Thanks for the reply
>
> -Mohit Kaushik
>
>
> On 12/22/2015 08:17 PM, Eric Newton wrote:
>
> A tablet server is given the rights to manage a tablet.
>
> It is critical that no other server uses the tablet to maintain
> consistency.
>
> To maintain the right to access a tablet, it must maintain a zookeeper
> session. The zookeeper session periodically exchanges keep-alive messages.
> If either party fails to get a keep-alive, zookeeper will close the
> connection. The client can attempt to reconnect, but if it fails to do so,
> the session will timeout.
>
> If the tablet server loses its session with zookeeper, the rest of the
> system can take over its tablets.
>
> When a tablet detects that it lost its zookeeper session, it kills itself
> to avoid doing anything with the tablets it no long has the right to host.
>
> What you are seeing here is the first step in that process, and it is
> probably due to the tablet server not sending a keep-alive message to
> zookeeper in time.
>
> There are many reasons for a tablet server to be delayed in sending a
> keep-alive message. By far the most common is that your system is
> over-subscribed for memory, and part of the tablet server's memory swapped
> out. Once the java garbage collection cycle swapped it back in, there was a
> considerable delay.
>
> However, there can be other things going on.  This is just a best guess.
> Monitor swap usage, as a first diagnostic step.
>
> -Eric
>
>
>
> On Tue, Dec 22, 2015 at 8:30 AM, mohit.kaushik <mohit.kaus...@orkash.com>
> wrote:
>
>> Dear All,
>>
>> The mutations rejected exception can be seen at client side with server
>> error 1.
>> *org.apache.accumulo.core.client.MutationsRejectedException: # constraint
>> violations : 0  security codes: {}  # server errors 1 # exceptions 1\n\tat
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)\n\tat
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)\n\tat
>> org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)\n\tat
>> com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)\n\tat
>> com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570)\n\tat
>> com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145)\n\tat
>> java.lang.Thread.run(Thread.java:745)\nCaused by:
>> org.apache.accumulo.core.client.impl.AccumuloServerException: Error on
>> server orkash1:9997\n\tat *
>>
>> I also found exceptions in Monitor related to Tracing.
>>
>> *Tracing spans are being dropped because there are already 5000 spans queued 
>> for delivery.
>> This does not affect performance, security or data integrity, but 
>> distributed tracing information is being lost.**and **6458 times**Got an 
>> IOException in internalRead!
>>      java.io.IOException: Connection reset by peer
>>              at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>              at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>              at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>              at 
>> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
>>              at 
>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:537)
>>              at 
>> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338)
>>              at 
>> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203)
>>              at 
>> org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.select(CustomNonBlockingServer.java:228)
>>              at 
>> org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.run*
>>
>>
>>
>> I am facing the following exceptions in tserver logs and one tserver goes
>> dead.
>>
>> *2015-12-22 09:37:27,173 [zookeeper.ZooCache] WARN : Saw (possibly)
>> transient exception communicating with ZooKeeper, will retry*
>> *org.apache.zookeeper.KeeperException$ConnectionLossException:
>> KeeperErrorCode = ConnectionLoss for
>> /accumulo/f8708e0d-9238-41f5-b948-8f435fd01207/tables/16/conf/table.split.threshold*
>> *        at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)*
>> *        at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)*
>> *        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)*
>> *        at
>> org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)*
>> *        at
>> org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)*
>> *        at
>> org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)*
>> *        at
>> org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)*
>> *        at
>> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:117)*
>> *        at
>> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:103)*
>> *        at
>> org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:99)*
>> *        at
>> org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:197)*
>> *        at
>> org.apache.accumulo.tserver.tablet.Tablet.findSplitRow(Tablet.java:1604)*
>> *        at
>> org.apache.accumulo.tserver.tablet.Tablet.needsSplit(Tablet.java:1772)*
>> *        at
>> org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:1853)*
>> *        at
>> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)*
>> *        at java.lang.Thread.run(Thread.java:745)*
>>
>> These are creating problems in continuously ingesting data and I also
>> experienced some delay in queries and table create commands.
>> Please comment what could be the cause of these exceptions?
>>
>> Thanks
>> Mohit Kaushik
>>
>>
>
>
> --
>
> * Mohit Kaushik*
> Software Engineer
> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>
> <http://politicomapper.orkash.com>interactive social intelligence at
> work...
>
> <https://www.facebook.com/Orkash2012>
> <http://www.linkedin.com/company/orkash-services-private-limited>
> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
> <http://www.orkash.com>
> <http://www.orkash.com> ... ensuring Assurance in complexity and
> uncertainty
>
> *This message including the attachments, if any, is a confidential
> business communication. If you are not the intended recipient it may be
> unlawful for you to read, copy, distribute, disclose or otherwise use the
> information in this e-mail. If you have received it in error or are not the
> intended recipient, please destroy it and notify the sender immediately.
> Thank you *
>

Reply via email to