Mike,

Yes, thanks for the help. We had to delete the recovered files generated
from the WAL a few times but that worked. Then we restarted the two tablets
with the TProtocolException exceptions to fix those errors. We saved off
the log files for you.

Jeff


-- 
Jeff Kubina
410-988-4436


On Fri, Oct 21, 2016 at 10:57 AM, Michael Wall <[email protected]> wrote:

> Andrew/Jeff,
>
> How's it going? Did you resolve your issue?
>
> Mike
>
> On Tue, Oct 18, 2016 at 10:42 AM, Andrew Hulbert <[email protected]>
> wrote:
>
>> I think it is attempting to do migrations at the moment FYI
>>
>> On 10/18/2016 10:40 AM, Andrew Hulbert wrote:
>>
>> Yes, it looks similar.
>>
>> Esp these parts:
>>
>> 2015-11-19 22:43:05,998 [impl.TabletServerBatchReaderIterator] DEBUG: 
>> org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 
>> but got 19
>> java.io.IOException: org.apache.thrift.protocol.TProtocolException: Expected 
>> protocol id ffffff82 but got 19
>>      at 
>> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:702)
>>      at 
>> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:349)
>>      at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>      at 
>> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>>      at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.thrift.protocol.TProtocolException: Expected protocol 
>> id ffffff82 but got 19
>>      at 
>> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:472)
>>      at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
>>      at 
>> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:317)
>>      at 
>> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:297)
>>      at 
>> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:634)
>>      ... 6 more
>>
>>
>>
>>
>> On 10/18/2016 10:34 AM, Josh Elser wrote:
>>
>> Or, if it's more convenient, this is the issue I was thinking of:
>> https://issues.apache.org/jira/browse/ACCUMULO-4065
>>
>> Andrew Hulbert wrote:
>>
>> I'll try to dig up the full error from the tserver
>>
>>
>> On 10/18/2016 10:30 AM, Josh Elser wrote:
>>
>> Do you have the full exception for the "Expected protocol id.." error?
>>
>> That looks like it might be incorrect usage of Thrift on our part..
>>
>> Andrew Hulbert wrote:
>>
>> Mike,
>>
>> So backing up and then later deleting the recovery directories a few
>> times did the trick. It seemed that removing the initial bad one caused
>> the others to go through for the most part...
>>
>> I believe all the WAL files were there. I'll look for the WAL deleted in
>> the GC logs and see if there's any evidence of that. It is version 1.6.4
>> by the way. Unfortunately can't send the logs to you here but I did save
>> them off and I'll talk to Jeff about what we can do.
>>
>> We are currently getting a new error that I'm going to look into...
>>
>> Expected protocol id ffffffff82 but got 0
>>
>> Expected protocol id ffffffff82 but got 6e
>>
>> etc.
>>
>> Looking into that now! Thanks for the help so far, as usual!
>>
>> Andrew
>>
>> On 10/18/2016 09:46 AM, Michael Wall wrote:
>>
>> Andrew,
>>
>> That is what I was going to suggest you try. Where is that "Unable to
>> find recovery files for extent" log? Anyway we can see some actual
>> logs?
>>
>> Are all the WALs there? Do you find any of the WAL deleted by GC in
>> the gc logs? Do you find any duplicates WALs in the HDFS trash?
>>
>> On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert <[email protected]
>> <mailto:[email protected]> <[email protected]>> wrote:
>>
>> Mike,
>>
>> For one of the WALs I backed up the recovery directory and that
>> initiated a new recovery attempt as indicated in the tserver debug
>> log...
>>
>> Then the exception was thrown:
>>
>> Unable to find recovery files for extent xxxxxx logentry xxxxx
>> hdfs://path/to/wal/yyyy
>>
>> Any ideas? I figure we can zero out the WAL and it will go on with
>> life but it would be nice to try and get the data!
>>
>> Thanks!
>>
>>
>> On 10/18/2016 08:55 AM, Jeff Kubina wrote:
>>
>>
>> On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall <[email protected]
>> <mailto:[email protected]> <[email protected]>> wrote:
>>
>> Take a look at the master logs for where the WAL was sorted
>> to the /accumulo/recovery/... directory. Then look to see if
>> those WALs are still around and contain content.
>>
>>
>> Checked one of them, yes it is around with content.
>>
>> Where is this this EOF exception, on a tserver?
>>
>>
>> Yes, the tserver.
>>
>> Is the master log complaining about anything?
>>
>>
>> Repeating a message similar to the tserver but also that the
>> tablet assignment failed for the tserver.
>>
>> tservers are not balancing because of all this.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Reply via email to