Since the rfiles on disk are "later" then the ones references, I tend to
think old metadata got rewritten.  Since you can't get a timeline to better
understand what happened, the only think I can think of is reingest all
data since a known good point.  And then do thing to make the future better
like tweak what logs you have save and upgrade to 1.9.1.  Sorry, I wish I
had better answers for you.


On Wed, May 16, 2018 at 11:25 AM Adam J. Shook <adamjsh...@gmail.com> wrote:

> I tried building a timeline but the logs are just not there.  We weren't
> sending the debug logs to Splunk due to the verbosity, but we may be
> tweaking the log4j settings a bit to make sure we get the log data stored
> in the event this happens again.  This very well could be attributed to the
> recovery failure; hard to say.  I'll be upgrading to 1.9.1 soon.
>
> On Mon, May 14, 2018 at 8:53 AM, Michael Wall <mjw...@gmail.com> wrote:
>
>> Can you pick some of the files that are missing and search through your
>> logs to put together a timeline?  See if you can find that file for a
>> specific tablet.  Then grab all the logs for when a file was created as
>> result of a compaction, and a when a file was included in compaction for
>> that table.  Follow compactions for that tablet until you started getting
>> errors.  Then see what logs you have for WAL replay during that time for
>> that tablet and the metadata and can try to correlate.
>>
>> It's a shame you don't have the GC logs.  If you saw it was GC'd then
>> showed up in the metadata table again that would help explain what
>> happened.  Like Christopher mentioned, this could be related to a recovery
>> failure.
>>
>> Mike
>>
>> On Sat, May 12, 2018 at 5:26 PM Adam J. Shook <adamjsh...@gmail.com>
>> wrote:
>>
>>> WALs are turned on.  Durability is set to flush for all tables except
>>> for root and metadata which are sync.  The current rfile names on HDFS
>>> and in the metadata table are greater than the files that are missing.
>>>  Searched through all of our current and historical logs in Splunk (which
>>> are only INFO level or higher).  Issues from the logs:
>>>
>>> * Problem reports saying the files are not found
>>> * IllegalStateException saying the rfile is closed when it tried to load
>>> the Bloom filter (likely the flappy DataNode)
>>> * IOException when reading the file saying Stream is closed (likely the
>>> flappy DataNode)
>>>
>>> Nothing in the GC logs -- all the above errors are in the tablet server
>>> logs.  The logs may have rolled over, though, and our debug logs don't make
>>> it into Splunk.
>>>
>>> --Adam
>>>
>>> On Fri, May 11, 2018 at 6:16 PM, Christopher <ctubb...@apache.org>
>>> wrote:
>>>
>>>> Oh, it occurs to me that this may be related to the WAL bugs that Keith
>>>> fixed for 1.9.1... which could affect the metadata table recovery after a
>>>> failure.
>>>>
>>>> On Fri, May 11, 2018 at 6:11 PM Michael Wall <mjw...@gmail.com> wrote:
>>>>
>>>>> Adam,
>>>>>
>>>>> Do you have GC logs?  Can you see if those missing RFiles were removed
>>>>> by the GC process?  That could indicate you somehow got old metadata info
>>>>> replayed.  Also, the rfiles increment so compare the current rfile names 
>>>>> in
>>>>> the srv.dir directory vs what is in the metadata table.  Are the existing
>>>>> files after files in the metadata.  Finally, pick a few of the missing
>>>>> files and grep all your master and tserver logs to see if you can learn
>>>>> anything.  This sounds ungood.
>>>>>
>>>>> Mike
>>>>>
>>>>> On Fri, May 11, 2018 at 6:06 PM Christopher <ctubb...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> This is strange. I've only ever seen this when HDFS has reported
>>>>>> problems, such as missing blocks, or another obvious failure. What is 
>>>>>> your
>>>>>> durability settings (were WALs turned on)?
>>>>>>
>>>>>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook <adamjsh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> On one of our clusters, there are a good number of missing RFiles
>>>>>>> from HDFS, however HDFS is not/has not reported any missing blocks.  We
>>>>>>> were experiencing issues with HDFS; some flapping DataNode processes 
>>>>>>> that
>>>>>>> needed more heap.
>>>>>>>
>>>>>>> I don't anticipate I can do much besides create a bunch of empty
>>>>>>> RFiles (open to suggestions).  My question is, Is it possible that 
>>>>>>> Accumulo
>>>>>>> could have written the metadata for these RFiles but failed to write it 
>>>>>>> to
>>>>>>> HDFS?  In which case it would have been re-tried later and the data was
>>>>>>> persisted to a different RFile?  Or is it an 'RFile is in Accumulo 
>>>>>>> metadata
>>>>>>> if and only if it is in HDFS' situation?
>>>>>>>
>>>>>>> Accumulo 1.8.1 on HDFS 2.6.0.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> --Adam
>>>>>>>
>>>>>>
>>>
>

Reply via email to