It should be safe to merge on the metadata table. That was one of the goals of moving the root tablet into its own table. I'm pretty sure we have a build test to ensure it works.
On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR <[email protected]> wrote: > *UNOFFICIAL* > Firstly, thankyou for your advice its been very helpful. > > Increasing the tablet server memory has allowed the metadata table to come > online. From using the rfile-info and looking at the splits for the > metadata table it appears that all the metadata table entries are in one > tablet. All tablet servers then query the one node hosting that tablet. > > I suspect the cause of this was a poorly designed table that at one point > the Accumulo gui reported 1.02T tablets for. We've subsequently deleted > that table but it might be that there were so many entries in the metadata > table that all splits on it were due to this massive table that had the > table id 1vm. > > To rectify this, is it safe to run a merge on the metadata table to force > it to redistribute? > > ------------------------------ > *From:* Michael Wall [mailto:[email protected]] > *Sent:* Wednesday, 22 February 2017 02:44 > > *To:* [email protected] > *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL] > Matt, > > If I am reading this correctly, you have a tablet that is being loading > onto a tserver. That tserver dies, so the tablet is then assigned to > another tablet. While the tablet is being loading, that tserver dies and > so on. Is that correct? > > Can you identify the tablet that is bouncing around? If so, try using > rfile-info -d to inspect the rfiles associated with that tablet. Also look > at the rfiles that compose that tablet to see if anything sticks out. > > Any logs that would help explain why the tablet server is dying? Can you > increase the memory of the tserver? > > Mike > > On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <[email protected]> wrote: > > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception > communicating with ZooKeeper, will retry > SessionExpiredException: KeeperErrorCode = Session expired for > /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory > > There can be a number of causes for this, but here are the most likely > ones. > > * JVM gc pauses > * ZooKeeper max client connections > * Operating System/Hardware-level pauses > > The former should be noticeable by the Accumulo log. There is a daemon > running which watches for pauses that happen and then reports them. If > this is happening, you might have to give the process some more Java > heap, tweak your CMS/G1 parameters, etc. > > For maxClientConnections, see > > https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html > > For the latter, swappiness is the most likely candidate (assuming this > is hopping across different physical nodes), as are "transparent huge > pages". If it is limited to a single host, things like bad NICs, hard > drives, and other hardware issues might be a source of slowness. > > On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR > <[email protected]> wrote: > > UNOFFICIAL > > > > It looks like an issue with one of the metadata table tablets. On startup > > the server that hosts a particular metadata tablet gets scanned by all > other > > tablet servers in the cluster. This then crashes that tablet server > with an > > error in the tserver log; > > > > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception > > communicating with ZooKeeper, will retry > > SessionExpiredException: KeeperErrorCode = Session expired for > > > /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory > > > > That metadata table tablet is then transferred to another host which then > > fails also, and so on. > > > > While the server is hosting this metadata tablet, we see the following > log > > statement from all tserver.logs in the cluster: > > > > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error > > org.apache.thrift.transport.TTransportException null > > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997,2342423df12341d) > > Hope that helps complete the picture. > > > > > > ________________________________ > > From: Christopher [mailto:[email protected]] > > Sent: Tuesday, 21 February 2017 13:17 > > > > To: [email protected] > > Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL] > > > > Removing them is probably a bad idea. The root table entries correspond > to > > split points in the metadata table. There is no need for the tables which > > existed when the metadata table split to still exist for this to > continue to > > act as a valid split point. > > > > Would need to see the exception stack trace, or at least an error > message, > > to troubleshoot the shell scanning error you saw. > > > > > > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR < > [email protected]> > > wrote: > >> > >> UNOFFICIAL > >> > >> In case it is ok to remove these from the root table, how can I scan the > >> root table for rows with a rowid starting with !0;1vm? > >> > >> Running "scan -b !0;1vm" throws an exception and exits the shell. > >> > >> > >> -----Original Message----- > >> From: Dickson, Matt MR [mailto:[email protected]] > >> Sent: Tuesday, 21 February 2017 09:30 > >> To: '[email protected]' > >> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL] > >> > >> UNOFFICIAL > >> > >> > >> Does that mean I should have entries for 1vm in the metadata table > >> corresponding to the root table? > >> > >> We are running 1.6.5 > >> > >> > >> -----Original Message----- > >> From: Josh Elser [mailto:[email protected]] > >> Sent: Tuesday, 21 February 2017 09:22 > >> To: [email protected] > >> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL] > >> > >> The root table should only reference the tablets in the metadata table. > >> It's a hierarchy: like metadata is for the user tables, root is for the > >> metadata table. > >> > >> What version are ya running, Matt? > >> > >> Dickson, Matt MR wrote: > >> > *UNOFFICIAL* > >> > > >> > I have a situation where all tablet servers are progressively being > >> > declared dead. From the logs the tservers report errors like: > >> > 2017-02-.... DEBUG: Scan failed thrift error > >> > org.apache.thrift.trasport.TTransportException null > >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997 > ,2342423df12341d) > >> > 1vm was a table id that was deleted several months ago so it appears > >> > there is some invalid reference somewhere. > >> > Scanning the metadata table "scan -b 1vm" returns no rows returned for > >> > 1vm. > >> > A scan of the accumulo.root table returns approximately 15 rows that > >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root > >> > table entries used and would it be safe to remove these entries since > >> > they reference a deleted table? > >> > Thanks in advance, > >> > Matt > >> > // > > > > -- > > Christopher > > -- Christopher
