Thanks for all of your help. We have a peer cluster that we'll be using to do some data reconciliation.
On Wed, May 16, 2018 at 11:29 AM, Michael Wall <mjw...@gmail.com> wrote: > Since the rfiles on disk are "later" then the ones references, I tend to > think old metadata got rewritten. Since you can't get a timeline to better > understand what happened, the only think I can think of is reingest all > data since a known good point. And then do thing to make the future better > like tweak what logs you have save and upgrade to 1.9.1. Sorry, I wish I > had better answers for you. > > > On Wed, May 16, 2018 at 11:25 AM Adam J. Shook <adamjsh...@gmail.com> > wrote: > >> I tried building a timeline but the logs are just not there. We weren't >> sending the debug logs to Splunk due to the verbosity, but we may be >> tweaking the log4j settings a bit to make sure we get the log data stored >> in the event this happens again. This very well could be attributed to the >> recovery failure; hard to say. I'll be upgrading to 1.9.1 soon. >> >> On Mon, May 14, 2018 at 8:53 AM, Michael Wall <mjw...@gmail.com> wrote: >> >>> Can you pick some of the files that are missing and search through your >>> logs to put together a timeline? See if you can find that file for a >>> specific tablet. Then grab all the logs for when a file was created as >>> result of a compaction, and a when a file was included in compaction for >>> that table. Follow compactions for that tablet until you started getting >>> errors. Then see what logs you have for WAL replay during that time for >>> that tablet and the metadata and can try to correlate. >>> >>> It's a shame you don't have the GC logs. If you saw it was GC'd then >>> showed up in the metadata table again that would help explain what >>> happened. Like Christopher mentioned, this could be related to a recovery >>> failure. >>> >>> Mike >>> >>> On Sat, May 12, 2018 at 5:26 PM Adam J. Shook <adamjsh...@gmail.com> >>> wrote: >>> >>>> WALs are turned on. Durability is set to flush for all tables except >>>> for root and metadata which are sync. The current rfile names on HDFS >>>> and in the metadata table are greater than the files that are missing. >>>> Searched through all of our current and historical logs in Splunk (which >>>> are only INFO level or higher). Issues from the logs: >>>> >>>> * Problem reports saying the files are not found >>>> * IllegalStateException saying the rfile is closed when it tried to >>>> load the Bloom filter (likely the flappy DataNode) >>>> * IOException when reading the file saying Stream is closed (likely the >>>> flappy DataNode) >>>> >>>> Nothing in the GC logs -- all the above errors are in the tablet server >>>> logs. The logs may have rolled over, though, and our debug logs don't make >>>> it into Splunk. >>>> >>>> --Adam >>>> >>>> On Fri, May 11, 2018 at 6:16 PM, Christopher <ctubb...@apache.org> >>>> wrote: >>>> >>>>> Oh, it occurs to me that this may be related to the WAL bugs that >>>>> Keith fixed for 1.9.1... which could affect the metadata table recovery >>>>> after a failure. >>>>> >>>>> On Fri, May 11, 2018 at 6:11 PM Michael Wall <mjw...@gmail.com> wrote: >>>>> >>>>>> Adam, >>>>>> >>>>>> Do you have GC logs? Can you see if those missing RFiles were >>>>>> removed by the GC process? That could indicate you somehow got old >>>>>> metadata info replayed. Also, the rfiles increment so compare the >>>>>> current >>>>>> rfile names in the srv.dir directory vs what is in the metadata table. >>>>>> Are >>>>>> the existing files after files in the metadata. Finally, pick a few of >>>>>> the >>>>>> missing files and grep all your master and tserver logs to see if you can >>>>>> learn anything. This sounds ungood. >>>>>> >>>>>> Mike >>>>>> >>>>>> On Fri, May 11, 2018 at 6:06 PM Christopher <ctubb...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> This is strange. I've only ever seen this when HDFS has reported >>>>>>> problems, such as missing blocks, or another obvious failure. What is >>>>>>> your >>>>>>> durability settings (were WALs turned on)? >>>>>>> >>>>>>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook <adamjsh...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello all, >>>>>>>> >>>>>>>> On one of our clusters, there are a good number of missing RFiles >>>>>>>> from HDFS, however HDFS is not/has not reported any missing blocks. We >>>>>>>> were experiencing issues with HDFS; some flapping DataNode processes >>>>>>>> that >>>>>>>> needed more heap. >>>>>>>> >>>>>>>> I don't anticipate I can do much besides create a bunch of empty >>>>>>>> RFiles (open to suggestions). My question is, Is it possible that >>>>>>>> Accumulo >>>>>>>> could have written the metadata for these RFiles but failed to write >>>>>>>> it to >>>>>>>> HDFS? In which case it would have been re-tried later and the data was >>>>>>>> persisted to a different RFile? Or is it an 'RFile is in Accumulo >>>>>>>> metadata >>>>>>>> if and only if it is in HDFS' situation? >>>>>>>> >>>>>>>> Accumulo 1.8.1 on HDFS 2.6.0. >>>>>>>> >>>>>>>> Thank you, >>>>>>>> --Adam >>>>>>>> >>>>>>> >>>> >>