Hi Nick, Looks like you are on the right track with the recovery file, which are created when WALs have to be replayed, see http://accumulo.apache.org/1.8/accumulo_user_manual.html#_recovery. Maybe try deleting hdfs://master01:9000/user/accumulo/accumulo/recovery/ 8bd07d5c-710f-4072-b351-8ce09d771237/finished and give it 10 min or so. I could be that one or both of those part files are bad, so your next step could be to remove the hdfs://master01:9000/user/accumulo/accumulo/recovery/ 8bd07d5c-710f-4072-b351-8ce09d771237/ directory entirely. Again, give Accumulo 10 min or more. I don't recall how to track 8bd07d5c-710f-4072-b351-8ce09d771237 back to the WALs, maybe look for the first occurrence of that in the logs to see if the WALs are still there. If not, maybe move hdfs://master01:9000/user/accumulo/accumulo/recovery/ 8bd07d5c-710f-4072-b351-8ce09d771237/ instead of delete. If you can figure out what is in the those WALs you will know what to replay.
Good luck Mike On Sat, Mar 17, 2018 at 7:14 PM Nick Wise <nicholas.w...@sa.catapult.org.uk> wrote: > Hello, > > > > I’m seeing a lot of errors such as the following across my production > cluster, which has 30 nodes and is running Accumulo 1.7 on Hadoop 2.7.1. > The system has been running for many months without error. I would > appreciate any guidance that can be given particularly if I should, or > indeed should not, stop the cluster in order to resolve. > > > > There are over 60 billion elements in the table associated with this > tablet, and rebuilding from scratch would be very difficult. I can stand > some data loss and re-ingest recent data, if that restores service. > > > > From /usr/local/accumulo/logs/master_master02.log: > > > > 2018-03-17 22:02:16,527 [master.Master] ERROR: node11:9997 reports > assignment failed for tablet > i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 > > 2018-03-17 22:02:16,560 [master.Master] ERROR: node25:9997 reports > assignment failed for tablet > i7;^A^@^C51008b0d-1fc7-4742-bc4f-67ec280c7ebc^@80000152d;^A^@^C50fb7a13-1f0f-4943-a94f-26fcd8d15439^@8000014d1e > > 2018-03-17 22:02:16,574 [master.Master] ERROR: node30:9997 reports > assignment failed for tablet > i7;^A^@^C52d07ff3-677b-44b2-bef5-23d027946401^@80000159fc;^A^@^C52cc4173-6e57-4ef0-81a7-879e77a7d820^@80000152 > > 2018-03-17 22:02:16,586 [master.Master] ERROR: node06:9997 reports > assignment failed for tablet > i6;02~1~posDataFeature~gcn~20170228;02~1~posDataFeature~gbv~201608 > > 2018-03-17 22:02:16,589 [master.Master] ERROR: node16:9997 reports > assignment failed for tablet > i6;00~1~posDataFeature~tbr~2016081;00~1~posDataFeature~t9w~201604 > > 2018-03-17 22:02:16,616 [master.Master] ERROR: node26:9997 reports > assignment failed for tablet > i6;02~1~posDataFeature~dxf~201607;02~1~posDataFeature~dvt~201602 > > 2018-03-17 22:02:16,694 [master.Master] ERROR: node17:9997 reports > assignment failed for tablet > i6;17~0~posDataFeature~k42~201505;17~0~posDataFeature~k3u~20160818 > > 2018-03-17 22:02:16,778 [master.Master] ERROR: node07:9997 reports > assignment failed for tablet > i6;10~0~posDataFeature~u33~2017122;10~0~posDataFeature~u1x~20160207 > > 2018-03-17 22:02:16,810 [master.Master] ERROR: node05:9997 reports > assignment failed for tablet > i6;10~1~posDataFeature~rqe~20140805;10~1~posDataFeature~rqc~20160309 > > 2018-03-17 22:02:16,825 [master.Master] ERROR: node18:9997 reports > assignment failed for tablet > i6;06~1~posDataFeature~6pn~2015082;06~1~posDataFeature~6nx~20170514 > > 2018-03-17 22:02:16,827 [master.Master] ERROR: node33:9997 reports > assignment failed for tablet > i6;09~1~posDataFeature~fu4~20170624;09~1~posDataFeature~fgc~20160928 > > 2018-03-17 22:02:16,859 [master.Master] ERROR: node31:9997 reports > assignment failed for tablet > i7;^A^@^Cc5e97d3d-f3d0-4c80-acfb-b4de1d2aaa0e^@8000014a62;^A^@^Cc5e44ab1-a0a1-4d9c-abce-250a27209c15^@80000156fd > > 2018-03-17 22:02:16,920 [master.Master] ERROR: node14:9997 reports > assignment failed for tablet > i6;22~0~posDataFeature~xqs~2016;22~0~posDataFeature~xn5~2016012 > > 2018-03-17 22:02:16,938 [master.Master] ERROR: node22:9997 reports > assignment failed for tablet > i6;23~1~posDataFeature~kdm~2015013;23~1~posDataFeature~kdh~20160708 > > 2018-03-17 22:02:16,981 [master.Master] ERROR: node15:9997 reports > assignment failed for tablet > i6;07~1~posDataFeature~w7e~2016092;07~1~posDataFeature~w49 > > 2018-03-17 22:02:17,194 [master.Master] ERROR: node21:9997 reports > assignment failed for tablet > i7;^A^@^C92dba50b-219b-46d0-932a-a12a58ca830f^@800001523;^A^@^C92da7a96-7011-41d0-82ea-1a39ae7b1a6d^@8000015959 > > 2018-03-17 22:02:17,209 [master.Master] ERROR: node13:9997 reports > assignment failed for tablet > i7;^A^@^Cc05;^A^@^Cc04d6645-9a4b-42b8-a8cc-58ff19e0b957^@800001465 > > 2018-03-17 22:02:17,574 [master.Master] ERROR: node30:9997 reports > assignment failed for tablet > i6;17~1~posDataFeature~e7;17~1~posDataFeature~e4r~20160509 > > 2018-03-17 22:02:17,586 [master.Master] ERROR: node06:9997 reports > assignment failed for tablet > i6;01~1~posDataFeature~d3v~20160222;01~1~posDataFeature~d3f~201707072 > > 2018-03-17 22:02:17,590 [master.Master] ERROR: node16:9997 reports > assignment failed for tablet > i7;^A^@^C88fe01d0-243b-487c-ab0d-30b02e8ccf69^@80000152c;^A^@^C88f2eba5-9c1f-4230-8dd2-d1a9f0f659bd^@8000015d > > 2018-03-17 22:02:17,617 [master.Master] ERROR: node26:9997 reports > assignment failed for tablet > i6;05~1~posDataFeature~d1f;05~1~posDataFeature~d0t~201507 > > 2018-03-17 22:02:17,694 [master.Master] ERROR: node17:9997 reports > assignment failed for tablet > i7;^A^@^C3df85419-e448-4fa3-87b3-ec95edae204b^@80000153;^A^@^C3df47e59-05af-43e9-a650-50685f66ec0e^@80000153d > > 2018-03-17 22:02:17,827 [master.Master] ERROR: node33:9997 reports > assignment failed for tablet > i7;^A^@^C50f11344-0f80-435e-a6fa-7312619e1535^@80000142b8;^A^@^C50eb8126-a75c-4d89-9767-2a6a71c6bfac^@800001552a > > 2018-03-17 22:02:17,859 [master.Master] ERROR: node31:9997 reports > assignment failed for tablet > i6;05~0~posDataFeature~pz8;05~0~posDataFeature~my0~20141 > > 2018-03-17 22:02:17,920 [master.Master] ERROR: node14:9997 reports > assignment failed for tablet > i6;14~1~posDataFeature~7ej~201606;14~1~posDataFeature~7ds~201608 > > 2018-03-17 22:02:17,938 [master.Master] ERROR: node22:9997 reports > assignment failed for tablet > i6;18~1~posDataFeature~u3f~2016031;18~1~posDataFeature~u3b~20150816 > > 2018-03-17 22:02:17,981 [master.Master] ERROR: node15:9997 reports > assignment failed for tablet > i6;14~0~posDataFeature~wtq~20151019;14~0~posDataFeature~wt3~20170605 > > 2018-03-17 22:02:18,194 [master.Master] ERROR: node21:9997 reports > assignment failed for tablet > i6;11~0~posDataFeature~dyq~201607;11~0~posDataFeature~drm~201507 > > > > From /usr/local/accumulo/logs/ tserver_node11.log > > > > 2018-03-17 22:02:16,516 [tserver.TabletServer] INFO : adding tablet > i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 back to the > assignment pool (retry 129) > > 2018-03-17 22:02:16,517 [tserver.TabletServer] INFO : node11:9997: got > assignment from master: > i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 > > 2018-03-17 22:02:16,521 [tablet.Tablet] INFO : Starting Write-Ahead Log > recovery for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 > > 2018-03-17 22:02:16,521 [tserver.TabletServer] INFO : Looking for > hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished > > 2018-03-17 22:02:16,521 [log.SortedLogRecovery] INFO : Looking at > mutations from > hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237 > for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 > > 2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : exception trying to > assign tablet > i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 > hdfs://master01:9000/user/accumulo/accumulo/tables/i6/t-011gdek > > java.lang.RuntimeException: java.io.IOException: java.io.EOFException > > at > org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:639) > > at > org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449) > > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2157) > > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > > at > org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.io.IOException: java.io.EOFException > > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:456) > > at > org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) > > at > org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:589) > > ... 9 more > > Caused by: java.io.EOFException > > at java.io.DataInputStream.readFully(DataInputStream.java:197) > > at java.io.DataInputStream.readFully(DataInputStream.java:169) > > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1848) > > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813) > > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762) > > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:443) > > at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399) > > at > org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113) > > at > org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) > > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:454) > > ... 11 more > > 2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : java.io.IOException: > java.io.EOFException > > 2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : failed to open > tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 > reporting failure to master > > 2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : rescheduling tablet > load in 600.00 seconds > > > > > > The same structure error is occurring on many (if not all, all that I have > so far checked) nodes across the cluster. From what I have looked at > 8bd07d5c-710f-4072-b351-8ce09d771237 appears to be a common feature, while > the other elements vary. > > > > The file > hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished > exists and is zero bytes. The folder > hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237 > has two further folders within, part-r-00000 and part-r-00001. Both have > files within called data and index. > > > > The data file in part-r-00000 is 1071KB ends abruptly, thus: > > > > + 2xœc``` > ------------------------------ > f Z > ------------------------------ > " 3 > Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05htsj1.rfxœc``` > > ------------------------------ > > > # 3xœc``` > ------------------------------ > f Z > ------------------------------ > $ 3 > Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05huonw.rfxœc``` > > ------------------------------ > > > % 3xœc``` > ------------------------------ > f Z > ------------------------------ > & 3 Khdfs://master01:9000/u > > > > The index file in part-r-00000 is zero bytes. > > > > The data file in part-r-00001 is 31906KB and ends thus (which looks > reasonable to me): > > > > ------------------------------ > > > xœc```lP22ª3¨+È/vI,ItKM,)-J+54©320´006434.b*MUI5HK1661ÒM³05Õ5ILLÓµ0KNÓM5²HL23H1624e``pc > [1] .D(ÞÊ ´=I_´û#£ƒ _« > ------------------------------ > PŸƒçά > ------------------------------ > @ZÓˆ±÷šÛD ^í)• Ì > > ------------------------------ > > > xœc```jP22ª3¨+È/vI,ItKM,)-J+54©320´006434.b*ÎS17HK1661ÒM³05Õ5±0KÑM2N5ÖMJ³02LI34KM5g``pc > [1] .D(ÞÎÈÀÀ˜¤/ºo#£ > ------------------------------ > ßÔç > @} > ž«š iM#ÆÞkn“ˆs[J±J*Â: s uÉ&ºII&)ºÉf‰ ©‰Ææ–f¦fp· ¡xÄm[1]å @·ñžÛÜ v[— > ------------------------------ > Üm“70Rv > > > > The index file in part-t-00001 is 5KB and equally looks reasonable. > > > > Any help or direction that you might be able to give would be most > gratefully received. > > > > Best regards, > > > > Nick > > > > > This email (and any attachments) may contain confidential information and > is intended solely for the recipient(s) to whom the email is addressed. If > you received this email in error, please inform us immediately and delete > the email and all attachments without further using, copying or disclosing > the information. This email and any attachments are believed to be, but > cannot be guaranteed to be, secure or virus-free. Satellite Applications > Catapult Limited is registered in England & Wales. Company Number: 7964746. > Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, > Oxfordshire OX11 0QR. >