Just for completeness, the solution in the end was to stop then start the tservers one at a time until the error cleared. I never found a way to work out which tserver was causing the issue.
From: Hart, Andrew [mailto:[email protected]] Sent: 07 October 2020 13:54 To: [email protected] Subject: RE: Continuous tablets unloaded and fails to balance from accumulo master EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe. EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre. Thanks for your suggestions Restarting the tserver that had the assigned to dead server tablets, was tried but nothing happened to the tablets because they were not part of any table and so did not appear to do anything. Scanning for missing loc entries – the command you suggested produced no output other than a zootraceclient was loaded statement. Restarting the master works for 1 balance only and then it returns to 1 tablets are unloaded. This is my current workaround for the last few weeks. I assume the tables are old and delete since their IDs in the metadata are lower than currently created ones and the ID doesn’t appear in tables –l I like your GC idea I will look into that. I may have cloned tables in the past to fix some other problem but it is not something I would normally do. Thanks for again for your ideas. From: Mike Miller <[email protected]<mailto:[email protected]>> Sent: 06 October 2020 19:53 To: [email protected]<mailto:[email protected]> Subject: Re: Continuous tablets unloaded and fails to balance from accumulo master EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe. EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre. It would help if you provided what commands you are running and some of the output (if possible) - or at least more detail of what you are seeing. It's had to provide specifics, because it's hard to understand how you got into this state, what you have done, and what the current state is. If tablets are assigned to a dead server, but you think that server is ok, did you try taking that server down? Once the server is detected as down, that should trigger reassignments - at that point you can restart the server. Scanning the accumulo.metadata table - does every extent have a loc entry? Something like: accumulo shell -u root -p secret -e 'scan -t accumulo.metadata -np -c loc' | grep -v loc Have you tried restarting the master? If the tables are "old" and deleted - what are you onlining? Have you tried to delete an offline table? Is you GC running to completion? Do you clone tables? One issue may be that Accumulo gc needs to check that a file is not shared between tables, maybe its running into issues completing that check? On Tue, Oct 6, 2020 at 12:57 PM Christopher <[email protected]<mailto:[email protected]>> wrote: I'm not sure CheckForMetadataProblems can check for all that many different types of problems. It is limited. If you have tablets still in the metadata table for tables that no longer exist, that indicates you probably had some sort of crash and possible corruption of your metadata. The only option would be to manually delete those entries. A command to automatically prune these would probably be dangerous... running it when there's a transient ZooKeeper problem, for example, could end up deleting all your tables... which would be bad. Although it is dangerous, manual surgery on the metadata table to remove these entries, as you suggested, is probably the best option. On Tue, Oct 6, 2020 at 12:03 PM Hart, Andrew <[email protected]<mailto:[email protected]>> wrote: I am still trying to find the one “unloaded tablet” that is preventing the cluster balancing, however, there are a lot of unassigned tablets. I have been getting rid of them by onlining tables and completing failed table deletes but I am still left with many tablets that are unassigned. They seem to be mostly from old deleted tables and so I am not sure why they are there at all. The unassigned tablets are shown in accumulo org.apache.accumulo.server.util.FindOfflineTablets and in accumulo admin checkTablets And as I said, some are assign to dead server but actually the server isn’t dead at all. CheckForMetadataProblems reports “All is well” I thought that if I could clear up this mess I could then eventually get to just one unassigned tablet which would be the “1 tablets are unloaded” one. (I would then clone the table or copy the data out or something) So the problem remains. The cluster doesn’t balance due to migrations. I don’t find a tablet with a future entry and I can’t find it in unassigned or offline tablets due to the large number of other (presumably defunct) tablets with unassigned problems in tables that no longer exist. There are warnings in the documentation about manually editing the accumulo metadata table but it seems that the only option is to go in with a deletemany on any rows that start with an old deleted table. There does not seem to be an “accumulo admin pruneDefunctTablets –t tid” command! :D From: Mike Miller <[email protected]<mailto:[email protected]>> Sent: 06 October 2020 16:27 To: [email protected]<mailto:[email protected]> Subject: Re: Continuous tablets unloaded and fails to balance from accumulo master EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe. EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre. Do you want to merge old tablets that don't exist anymore? I am not sure what you are asking... you might have better luck if you provide some more info and ask on Slack: https://accumulo.apache.org/contact-us/#slack<https://urldefense.proofpoint.com/v2/url?u=https-3A__accumulo.apache.org_contact-2Dus_-23slack&d=DwMFaQ&c=H50I6Bh8SW87d_bXfZP_8g&r=f1Vi1t2KLSKTuTeSpDUCXg&m=Lgh2fhFz4BGHb5Zc9up-gHPYKgQEyQzp4d5XjC5P35A&s=-e_h4A8fCLAqaw1Etl-J2VMdIHWi-Et0FEJW_DgZTbo&e=> On Tue, Oct 6, 2020 at 7:25 AM Hart, Andrew <[email protected]<mailto:[email protected]>> wrote: What is the way to remove tablets that still exist in accumulo but do not have an online, offline or deleting table? Some of these tablets say ASSIGNED TO DEAD SERVER but the tserver they refer to is up and working properly. From: Hart, Andrew <[email protected]<mailto:[email protected]>> Sent: 25 September 2020 13:52 To: [email protected]<mailto:[email protected]> Subject: RE: Continuous tablets unloaded and fails to balance from accumulo master EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe. EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre. Thanks for your help. In looking for this I think I have found that there are deleted tables that still have a lot of tablets in the metadata table. I need to solve that before coming back to find the 1 unloaded tablet. Cheers And. From: Mike Miller <[email protected]<mailto:[email protected]>> Sent: 24 September 2020 16:08 To: [email protected]<mailto:[email protected]> Subject: Re: Continuous tablets unloaded and fails to balance from accumulo master EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe. EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre. That might be OK, could just mean it hasn't been assigned yet. The only way I can think of is to populate a list of all tablets from the metadata table and find the one without a "loc" column family. On Thu, Sep 24, 2020 at 10:55 AM Hart, Andrew <[email protected]<mailto:[email protected]>> wrote: No, no future entries in the table. From: Mike Miller <[email protected]<mailto:[email protected]>> Sent: 24 September 2020 15:10 To: [email protected]<mailto:[email protected]> Subject: Re: Continuous tablets unloaded and fails to balance from accumulo master EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe. EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre. You should be able to figure out the unloaded tablet from the "accumulo.metadata" table. The metadata table will list the tablet location using the "loc" column family to indicate it has loaded a tablet that it was assigned. For example the tablet "n;9" will have an entry like: n;9 loc:1000041fbf00006 [] ip-172-31-87-51.ec2.internal:9997 From my understanding, the unloaded tablet should have a "future" column family, meaning it has been assigned a new location but not loaded yet. If the tablet doesn't have a "loc" or "future" column family then that is a problem. On Thu, Sep 24, 2020 at 6:32 AM Hart, Andrew <[email protected]<mailto:[email protected]>> wrote: Hi, I am getting “Not balancing due to 1 outstanding migrations” and “[Normal tablets]: 1 tablets unloaded”. This means that the cluster never balances unless I restart the master, after which I get a 1 off balance and then it returns to the above messages. How do I identify the tablet that is unloaded? It isn’t in the logs that I can see. Is it possible to tell from the contents of the accumulo.metadata table? Is there a way to use FindOfflineTablets? And.
