Thanks for all of your help.  We have a peer cluster that we'll be using to
do some data reconciliation.

On Wed, May 16, 2018 at 11:29 AM, Michael Wall <mjw...@gmail.com> wrote:

> Since the rfiles on disk are "later" then the ones references, I tend to
> think old metadata got rewritten.  Since you can't get a timeline to better
> understand what happened, the only think I can think of is reingest all
> data since a known good point.  And then do thing to make the future better
> like tweak what logs you have save and upgrade to 1.9.1.  Sorry, I wish I
> had better answers for you.
>
>
> On Wed, May 16, 2018 at 11:25 AM Adam J. Shook <adamjsh...@gmail.com>
> wrote:
>
>> I tried building a timeline but the logs are just not there.  We weren't
>> sending the debug logs to Splunk due to the verbosity, but we may be
>> tweaking the log4j settings a bit to make sure we get the log data stored
>> in the event this happens again.  This very well could be attributed to the
>> recovery failure; hard to say.  I'll be upgrading to 1.9.1 soon.
>>
>> On Mon, May 14, 2018 at 8:53 AM, Michael Wall <mjw...@gmail.com> wrote:
>>
>>> Can you pick some of the files that are missing and search through your
>>> logs to put together a timeline?  See if you can find that file for a
>>> specific tablet.  Then grab all the logs for when a file was created as
>>> result of a compaction, and a when a file was included in compaction for
>>> that table.  Follow compactions for that tablet until you started getting
>>> errors.  Then see what logs you have for WAL replay during that time for
>>> that tablet and the metadata and can try to correlate.
>>>
>>> It's a shame you don't have the GC logs.  If you saw it was GC'd then
>>> showed up in the metadata table again that would help explain what
>>> happened.  Like Christopher mentioned, this could be related to a recovery
>>> failure.
>>>
>>> Mike
>>>
>>> On Sat, May 12, 2018 at 5:26 PM Adam J. Shook <adamjsh...@gmail.com>
>>> wrote:
>>>
>>>> WALs are turned on.  Durability is set to flush for all tables except
>>>> for root and metadata which are sync.  The current rfile names on HDFS
>>>> and in the metadata table are greater than the files that are missing.
>>>>  Searched through all of our current and historical logs in Splunk (which
>>>> are only INFO level or higher).  Issues from the logs:
>>>>
>>>> * Problem reports saying the files are not found
>>>> * IllegalStateException saying the rfile is closed when it tried to
>>>> load the Bloom filter (likely the flappy DataNode)
>>>> * IOException when reading the file saying Stream is closed (likely the
>>>> flappy DataNode)
>>>>
>>>> Nothing in the GC logs -- all the above errors are in the tablet server
>>>> logs.  The logs may have rolled over, though, and our debug logs don't make
>>>> it into Splunk.
>>>>
>>>> --Adam
>>>>
>>>> On Fri, May 11, 2018 at 6:16 PM, Christopher <ctubb...@apache.org>
>>>> wrote:
>>>>
>>>>> Oh, it occurs to me that this may be related to the WAL bugs that
>>>>> Keith fixed for 1.9.1... which could affect the metadata table recovery
>>>>> after a failure.
>>>>>
>>>>> On Fri, May 11, 2018 at 6:11 PM Michael Wall <mjw...@gmail.com> wrote:
>>>>>
>>>>>> Adam,
>>>>>>
>>>>>> Do you have GC logs?  Can you see if those missing RFiles were
>>>>>> removed by the GC process?  That could indicate you somehow got old
>>>>>> metadata info replayed.  Also, the rfiles increment so compare the 
>>>>>> current
>>>>>> rfile names in the srv.dir directory vs what is in the metadata table.  
>>>>>> Are
>>>>>> the existing files after files in the metadata.  Finally, pick a few of 
>>>>>> the
>>>>>> missing files and grep all your master and tserver logs to see if you can
>>>>>> learn anything.  This sounds ungood.
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Fri, May 11, 2018 at 6:06 PM Christopher <ctubb...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> This is strange. I've only ever seen this when HDFS has reported
>>>>>>> problems, such as missing blocks, or another obvious failure. What is 
>>>>>>> your
>>>>>>> durability settings (were WALs turned on)?
>>>>>>>
>>>>>>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook <adamjsh...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>
>>>>>>>> On one of our clusters, there are a good number of missing RFiles
>>>>>>>> from HDFS, however HDFS is not/has not reported any missing blocks.  We
>>>>>>>> were experiencing issues with HDFS; some flapping DataNode processes 
>>>>>>>> that
>>>>>>>> needed more heap.
>>>>>>>>
>>>>>>>> I don't anticipate I can do much besides create a bunch of empty
>>>>>>>> RFiles (open to suggestions).  My question is, Is it possible that 
>>>>>>>> Accumulo
>>>>>>>> could have written the metadata for these RFiles but failed to write 
>>>>>>>> it to
>>>>>>>> HDFS?  In which case it would have been re-tried later and the data was
>>>>>>>> persisted to a different RFile?  Or is it an 'RFile is in Accumulo 
>>>>>>>> metadata
>>>>>>>> if and only if it is in HDFS' situation?
>>>>>>>>
>>>>>>>> Accumulo 1.8.1 on HDFS 2.6.0.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> --Adam
>>>>>>>>
>>>>>>>
>>>>
>>

Reply via email to