Since the rfiles on disk are "later" then the ones references, I tend to think old metadata got rewritten. Since you can't get a timeline to better understand what happened, the only think I can think of is reingest all data since a known good point. And then do thing to make the future better like tweak what logs you have save and upgrade to 1.9.1. Sorry, I wish I had better answers for you.
On Wed, May 16, 2018 at 11:25 AM Adam J. Shook <adamjsh...@gmail.com> wrote: > I tried building a timeline but the logs are just not there. We weren't > sending the debug logs to Splunk due to the verbosity, but we may be > tweaking the log4j settings a bit to make sure we get the log data stored > in the event this happens again. This very well could be attributed to the > recovery failure; hard to say. I'll be upgrading to 1.9.1 soon. > > On Mon, May 14, 2018 at 8:53 AM, Michael Wall <mjw...@gmail.com> wrote: > >> Can you pick some of the files that are missing and search through your >> logs to put together a timeline? See if you can find that file for a >> specific tablet. Then grab all the logs for when a file was created as >> result of a compaction, and a when a file was included in compaction for >> that table. Follow compactions for that tablet until you started getting >> errors. Then see what logs you have for WAL replay during that time for >> that tablet and the metadata and can try to correlate. >> >> It's a shame you don't have the GC logs. If you saw it was GC'd then >> showed up in the metadata table again that would help explain what >> happened. Like Christopher mentioned, this could be related to a recovery >> failure. >> >> Mike >> >> On Sat, May 12, 2018 at 5:26 PM Adam J. Shook <adamjsh...@gmail.com> >> wrote: >> >>> WALs are turned on. Durability is set to flush for all tables except >>> for root and metadata which are sync. The current rfile names on HDFS >>> and in the metadata table are greater than the files that are missing. >>> Searched through all of our current and historical logs in Splunk (which >>> are only INFO level or higher). Issues from the logs: >>> >>> * Problem reports saying the files are not found >>> * IllegalStateException saying the rfile is closed when it tried to load >>> the Bloom filter (likely the flappy DataNode) >>> * IOException when reading the file saying Stream is closed (likely the >>> flappy DataNode) >>> >>> Nothing in the GC logs -- all the above errors are in the tablet server >>> logs. The logs may have rolled over, though, and our debug logs don't make >>> it into Splunk. >>> >>> --Adam >>> >>> On Fri, May 11, 2018 at 6:16 PM, Christopher <ctubb...@apache.org> >>> wrote: >>> >>>> Oh, it occurs to me that this may be related to the WAL bugs that Keith >>>> fixed for 1.9.1... which could affect the metadata table recovery after a >>>> failure. >>>> >>>> On Fri, May 11, 2018 at 6:11 PM Michael Wall <mjw...@gmail.com> wrote: >>>> >>>>> Adam, >>>>> >>>>> Do you have GC logs? Can you see if those missing RFiles were removed >>>>> by the GC process? That could indicate you somehow got old metadata info >>>>> replayed. Also, the rfiles increment so compare the current rfile names >>>>> in >>>>> the srv.dir directory vs what is in the metadata table. Are the existing >>>>> files after files in the metadata. Finally, pick a few of the missing >>>>> files and grep all your master and tserver logs to see if you can learn >>>>> anything. This sounds ungood. >>>>> >>>>> Mike >>>>> >>>>> On Fri, May 11, 2018 at 6:06 PM Christopher <ctubb...@apache.org> >>>>> wrote: >>>>> >>>>>> This is strange. I've only ever seen this when HDFS has reported >>>>>> problems, such as missing blocks, or another obvious failure. What is >>>>>> your >>>>>> durability settings (were WALs turned on)? >>>>>> >>>>>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook <adamjsh...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> On one of our clusters, there are a good number of missing RFiles >>>>>>> from HDFS, however HDFS is not/has not reported any missing blocks. We >>>>>>> were experiencing issues with HDFS; some flapping DataNode processes >>>>>>> that >>>>>>> needed more heap. >>>>>>> >>>>>>> I don't anticipate I can do much besides create a bunch of empty >>>>>>> RFiles (open to suggestions). My question is, Is it possible that >>>>>>> Accumulo >>>>>>> could have written the metadata for these RFiles but failed to write it >>>>>>> to >>>>>>> HDFS? In which case it would have been re-tried later and the data was >>>>>>> persisted to a different RFile? Or is it an 'RFile is in Accumulo >>>>>>> metadata >>>>>>> if and only if it is in HDFS' situation? >>>>>>> >>>>>>> Accumulo 1.8.1 on HDFS 2.6.0. >>>>>>> >>>>>>> Thank you, >>>>>>> --Adam >>>>>>> >>>>>> >>> >