... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception communicating with ZooKeeper, will retry SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
There can be a number of causes for this, but here are the most likely ones. * JVM gc pauses * ZooKeeper max client connections * Operating System/Hardware-level pauses The former should be noticeable by the Accumulo log. There is a daemon running which watches for pauses that happen and then reports them. If this is happening, you might have to give the process some more Java heap, tweak your CMS/G1 parameters, etc. For maxClientConnections, see https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html For the latter, swappiness is the most likely candidate (assuming this is hopping across different physical nodes), as are "transparent huge pages". If it is limited to a single host, things like bad NICs, hard drives, and other hardware issues might be a source of slowness. On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR <[email protected]> wrote: > UNOFFICIAL > > It looks like an issue with one of the metadata table tablets. On startup > the server that hosts a particular metadata tablet gets scanned by all other > tablet servers in the cluster. This then crashes that tablet server with an > error in the tserver log; > > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception > communicating with ZooKeeper, will retry > SessionExpiredException: KeeperErrorCode = Session expired for > /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory > > That metadata table tablet is then transferred to another host which then > fails also, and so on. > > While the server is hosting this metadata tablet, we see the following log > statement from all tserver.logs in the cluster: > > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error > org.apache.thrift.transport.TTransportException null > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997,2342423df12341d) > Hope that helps complete the picture. > > > ________________________________ > From: Christopher [mailto:[email protected]] > Sent: Tuesday, 21 February 2017 13:17 > > To: [email protected] > Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL] > > Removing them is probably a bad idea. The root table entries correspond to > split points in the metadata table. There is no need for the tables which > existed when the metadata table split to still exist for this to continue to > act as a valid split point. > > Would need to see the exception stack trace, or at least an error message, > to troubleshoot the shell scanning error you saw. > > > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <[email protected]> > wrote: >> >> UNOFFICIAL >> >> In case it is ok to remove these from the root table, how can I scan the >> root table for rows with a rowid starting with !0;1vm? >> >> Running "scan -b !0;1vm" throws an exception and exits the shell. >> >> >> -----Original Message----- >> From: Dickson, Matt MR [mailto:[email protected]] >> Sent: Tuesday, 21 February 2017 09:30 >> To: '[email protected]' >> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL] >> >> UNOFFICIAL >> >> >> Does that mean I should have entries for 1vm in the metadata table >> corresponding to the root table? >> >> We are running 1.6.5 >> >> >> -----Original Message----- >> From: Josh Elser [mailto:[email protected]] >> Sent: Tuesday, 21 February 2017 09:22 >> To: [email protected] >> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL] >> >> The root table should only reference the tablets in the metadata table. >> It's a hierarchy: like metadata is for the user tables, root is for the >> metadata table. >> >> What version are ya running, Matt? >> >> Dickson, Matt MR wrote: >> > *UNOFFICIAL* >> > >> > I have a situation where all tablet servers are progressively being >> > declared dead. From the logs the tservers report errors like: >> > 2017-02-.... DEBUG: Scan failed thrift error >> > org.apache.thrift.trasport.TTransportException null >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997,2342423df12341d) >> > 1vm was a table id that was deleted several months ago so it appears >> > there is some invalid reference somewhere. >> > Scanning the metadata table "scan -b 1vm" returns no rows returned for >> > 1vm. >> > A scan of the accumulo.root table returns approximately 15 rows that >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root >> > table entries used and would it be safe to remove these entries since >> > they reference a deleted table? >> > Thanks in advance, >> > Matt >> > // > > -- > Christopher
