Mike, Yes, thanks for the help. We had to delete the recovered files generated from the WAL a few times but that worked. Then we restarted the two tablets with the TProtocolException exceptions to fix those errors. We saved off the log files for you.
Jeff -- Jeff Kubina 410-988-4436 On Fri, Oct 21, 2016 at 10:57 AM, Michael Wall <[email protected]> wrote: > Andrew/Jeff, > > How's it going? Did you resolve your issue? > > Mike > > On Tue, Oct 18, 2016 at 10:42 AM, Andrew Hulbert <[email protected]> > wrote: > >> I think it is attempting to do migrations at the moment FYI >> >> On 10/18/2016 10:40 AM, Andrew Hulbert wrote: >> >> Yes, it looks similar. >> >> Esp these parts: >> >> 2015-11-19 22:43:05,998 [impl.TabletServerBatchReaderIterator] DEBUG: >> org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 >> but got 19 >> java.io.IOException: org.apache.thrift.protocol.TProtocolException: Expected >> protocol id ffffff82 but got 19 >> at >> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:702) >> at >> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:349) >> at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at >> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: org.apache.thrift.protocol.TProtocolException: Expected protocol >> id ffffff82 but got 19 >> at >> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:472) >> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) >> at >> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:317) >> at >> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:297) >> at >> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:634) >> ... 6 more >> >> >> >> >> On 10/18/2016 10:34 AM, Josh Elser wrote: >> >> Or, if it's more convenient, this is the issue I was thinking of: >> https://issues.apache.org/jira/browse/ACCUMULO-4065 >> >> Andrew Hulbert wrote: >> >> I'll try to dig up the full error from the tserver >> >> >> On 10/18/2016 10:30 AM, Josh Elser wrote: >> >> Do you have the full exception for the "Expected protocol id.." error? >> >> That looks like it might be incorrect usage of Thrift on our part.. >> >> Andrew Hulbert wrote: >> >> Mike, >> >> So backing up and then later deleting the recovery directories a few >> times did the trick. It seemed that removing the initial bad one caused >> the others to go through for the most part... >> >> I believe all the WAL files were there. I'll look for the WAL deleted in >> the GC logs and see if there's any evidence of that. It is version 1.6.4 >> by the way. Unfortunately can't send the logs to you here but I did save >> them off and I'll talk to Jeff about what we can do. >> >> We are currently getting a new error that I'm going to look into... >> >> Expected protocol id ffffffff82 but got 0 >> >> Expected protocol id ffffffff82 but got 6e >> >> etc. >> >> Looking into that now! Thanks for the help so far, as usual! >> >> Andrew >> >> On 10/18/2016 09:46 AM, Michael Wall wrote: >> >> Andrew, >> >> That is what I was going to suggest you try. Where is that "Unable to >> find recovery files for extent" log? Anyway we can see some actual >> logs? >> >> Are all the WALs there? Do you find any of the WAL deleted by GC in >> the gc logs? Do you find any duplicates WALs in the HDFS trash? >> >> On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert <[email protected] >> <mailto:[email protected]> <[email protected]>> wrote: >> >> Mike, >> >> For one of the WALs I backed up the recovery directory and that >> initiated a new recovery attempt as indicated in the tserver debug >> log... >> >> Then the exception was thrown: >> >> Unable to find recovery files for extent xxxxxx logentry xxxxx >> hdfs://path/to/wal/yyyy >> >> Any ideas? I figure we can zero out the WAL and it will go on with >> life but it would be nice to try and get the data! >> >> Thanks! >> >> >> On 10/18/2016 08:55 AM, Jeff Kubina wrote: >> >> >> On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall <[email protected] >> <mailto:[email protected]> <[email protected]>> wrote: >> >> Take a look at the master logs for where the WAL was sorted >> to the /accumulo/recovery/... directory. Then look to see if >> those WALs are still around and contain content. >> >> >> Checked one of them, yes it is around with content. >> >> Where is this this EOF exception, on a tserver? >> >> >> Yes, the tserver. >> >> Is the master log complaining about anything? >> >> >> Repeating a message similar to the tserver but also that the >> tablet assignment failed for the tserver. >> >> tservers are not balancing because of all this. >> >> >> >> >> >> >> >> >> >
