Hi Nick,

Looks like you are on the right track with the recovery file, which are
created when WALs have to be replayed, see
http://accumulo.apache.org/1.8/accumulo_user_manual.html#_recovery.  Maybe
try deleting hdfs://master01:9000/user/accumulo/accumulo/recovery/
8bd07d5c-710f-4072-b351-8ce09d771237/finished and give it 10 min or so.  I
could be that one or both of those part files are bad, so your next step
could be to remove the hdfs://master01:9000/user/accumulo/accumulo/recovery/
8bd07d5c-710f-4072-b351-8ce09d771237/ directory entirely.  Again, give
Accumulo 10 min or more.  I don't recall how to track
8bd07d5c-710f-4072-b351-8ce09d771237 back to the WALs, maybe look for the
first occurrence of that in the logs to see if the WALs are still there.
If not, maybe move hdfs://master01:9000/user/accumulo/accumulo/recovery/
8bd07d5c-710f-4072-b351-8ce09d771237/ instead of delete. If you can figure
out what is in the those WALs you will know what to replay.

Good luck

Mike

On Sat, Mar 17, 2018 at 7:14 PM Nick Wise <nicholas.w...@sa.catapult.org.uk>
wrote:

> Hello,
>
>
>
> I’m seeing a lot of errors such as the following across my production
> cluster, which has 30 nodes and is running Accumulo 1.7 on Hadoop 2.7.1.
> The system has been running for many months without error.  I would
> appreciate any guidance that can be given particularly if I should, or
> indeed should not, stop the cluster in order to resolve.
>
>
>
> There are over 60 billion elements in the table associated with this
> tablet, and rebuilding from scratch would be very difficult.  I can stand
> some data loss and re-ingest recent data, if that restores service.
>
>
>
> From /usr/local/accumulo/logs/master_master02.log:
>
>
>
> 2018-03-17 22:02:16,527 [master.Master] ERROR: node11:9997 reports
> assignment failed for tablet
> i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
>
> 2018-03-17 22:02:16,560 [master.Master] ERROR: node25:9997 reports
> assignment failed for tablet
> i7;^A^@^C51008b0d-1fc7-4742-bc4f-67ec280c7ebc^@80000152d;^A^@^C50fb7a13-1f0f-4943-a94f-26fcd8d15439^@8000014d1e
>
> 2018-03-17 22:02:16,574 [master.Master] ERROR: node30:9997 reports
> assignment failed for tablet
> i7;^A^@^C52d07ff3-677b-44b2-bef5-23d027946401^@80000159fc;^A^@^C52cc4173-6e57-4ef0-81a7-879e77a7d820^@80000152
>
> 2018-03-17 22:02:16,586 [master.Master] ERROR: node06:9997 reports
> assignment failed for tablet
> i6;02~1~posDataFeature~gcn~20170228;02~1~posDataFeature~gbv~201608
>
> 2018-03-17 22:02:16,589 [master.Master] ERROR: node16:9997 reports
> assignment failed for tablet
> i6;00~1~posDataFeature~tbr~2016081;00~1~posDataFeature~t9w~201604
>
> 2018-03-17 22:02:16,616 [master.Master] ERROR: node26:9997 reports
> assignment failed for tablet
> i6;02~1~posDataFeature~dxf~201607;02~1~posDataFeature~dvt~201602
>
> 2018-03-17 22:02:16,694 [master.Master] ERROR: node17:9997 reports
> assignment failed for tablet
> i6;17~0~posDataFeature~k42~201505;17~0~posDataFeature~k3u~20160818
>
> 2018-03-17 22:02:16,778 [master.Master] ERROR: node07:9997 reports
> assignment failed for tablet
> i6;10~0~posDataFeature~u33~2017122;10~0~posDataFeature~u1x~20160207
>
> 2018-03-17 22:02:16,810 [master.Master] ERROR: node05:9997 reports
> assignment failed for tablet
> i6;10~1~posDataFeature~rqe~20140805;10~1~posDataFeature~rqc~20160309
>
> 2018-03-17 22:02:16,825 [master.Master] ERROR: node18:9997 reports
> assignment failed for tablet
> i6;06~1~posDataFeature~6pn~2015082;06~1~posDataFeature~6nx~20170514
>
> 2018-03-17 22:02:16,827 [master.Master] ERROR: node33:9997 reports
> assignment failed for tablet
> i6;09~1~posDataFeature~fu4~20170624;09~1~posDataFeature~fgc~20160928
>
> 2018-03-17 22:02:16,859 [master.Master] ERROR: node31:9997 reports
> assignment failed for tablet
> i7;^A^@^Cc5e97d3d-f3d0-4c80-acfb-b4de1d2aaa0e^@8000014a62;^A^@^Cc5e44ab1-a0a1-4d9c-abce-250a27209c15^@80000156fd
>
> 2018-03-17 22:02:16,920 [master.Master] ERROR: node14:9997 reports
> assignment failed for tablet
> i6;22~0~posDataFeature~xqs~2016;22~0~posDataFeature~xn5~2016012
>
> 2018-03-17 22:02:16,938 [master.Master] ERROR: node22:9997 reports
> assignment failed for tablet
> i6;23~1~posDataFeature~kdm~2015013;23~1~posDataFeature~kdh~20160708
>
> 2018-03-17 22:02:16,981 [master.Master] ERROR: node15:9997 reports
> assignment failed for tablet
> i6;07~1~posDataFeature~w7e~2016092;07~1~posDataFeature~w49
>
> 2018-03-17 22:02:17,194 [master.Master] ERROR: node21:9997 reports
> assignment failed for tablet
> i7;^A^@^C92dba50b-219b-46d0-932a-a12a58ca830f^@800001523;^A^@^C92da7a96-7011-41d0-82ea-1a39ae7b1a6d^@8000015959
>
> 2018-03-17 22:02:17,209 [master.Master] ERROR: node13:9997 reports
> assignment failed for tablet
> i7;^A^@^Cc05;^A^@^Cc04d6645-9a4b-42b8-a8cc-58ff19e0b957^@800001465
>
> 2018-03-17 22:02:17,574 [master.Master] ERROR: node30:9997 reports
> assignment failed for tablet
> i6;17~1~posDataFeature~e7;17~1~posDataFeature~e4r~20160509
>
> 2018-03-17 22:02:17,586 [master.Master] ERROR: node06:9997 reports
> assignment failed for tablet
> i6;01~1~posDataFeature~d3v~20160222;01~1~posDataFeature~d3f~201707072
>
> 2018-03-17 22:02:17,590 [master.Master] ERROR: node16:9997 reports
> assignment failed for tablet
> i7;^A^@^C88fe01d0-243b-487c-ab0d-30b02e8ccf69^@80000152c;^A^@^C88f2eba5-9c1f-4230-8dd2-d1a9f0f659bd^@8000015d
>
> 2018-03-17 22:02:17,617 [master.Master] ERROR: node26:9997 reports
> assignment failed for tablet
> i6;05~1~posDataFeature~d1f;05~1~posDataFeature~d0t~201507
>
> 2018-03-17 22:02:17,694 [master.Master] ERROR: node17:9997 reports
> assignment failed for tablet
> i7;^A^@^C3df85419-e448-4fa3-87b3-ec95edae204b^@80000153;^A^@^C3df47e59-05af-43e9-a650-50685f66ec0e^@80000153d
>
> 2018-03-17 22:02:17,827 [master.Master] ERROR: node33:9997 reports
> assignment failed for tablet
> i7;^A^@^C50f11344-0f80-435e-a6fa-7312619e1535^@80000142b8;^A^@^C50eb8126-a75c-4d89-9767-2a6a71c6bfac^@800001552a
>
> 2018-03-17 22:02:17,859 [master.Master] ERROR: node31:9997 reports
> assignment failed for tablet
> i6;05~0~posDataFeature~pz8;05~0~posDataFeature~my0~20141
>
> 2018-03-17 22:02:17,920 [master.Master] ERROR: node14:9997 reports
> assignment failed for tablet
> i6;14~1~posDataFeature~7ej~201606;14~1~posDataFeature~7ds~201608
>
> 2018-03-17 22:02:17,938 [master.Master] ERROR: node22:9997 reports
> assignment failed for tablet
> i6;18~1~posDataFeature~u3f~2016031;18~1~posDataFeature~u3b~20150816
>
> 2018-03-17 22:02:17,981 [master.Master] ERROR: node15:9997 reports
> assignment failed for tablet
> i6;14~0~posDataFeature~wtq~20151019;14~0~posDataFeature~wt3~20170605
>
> 2018-03-17 22:02:18,194 [master.Master] ERROR: node21:9997 reports
> assignment failed for tablet
> i6;11~0~posDataFeature~dyq~201607;11~0~posDataFeature~drm~201507
>
>
>
> From /usr/local/accumulo/logs/ tserver_node11.log
>
>
>
> 2018-03-17 22:02:16,516 [tserver.TabletServer] INFO : adding tablet
> i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 back to the
> assignment pool (retry 129)
>
> 2018-03-17 22:02:16,517 [tserver.TabletServer] INFO : node11:9997: got
> assignment from master:
> i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
>
> 2018-03-17 22:02:16,521 [tablet.Tablet] INFO : Starting Write-Ahead Log
> recovery for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
>
> 2018-03-17 22:02:16,521 [tserver.TabletServer] INFO : Looking for
> hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished
>
> 2018-03-17 22:02:16,521 [log.SortedLogRecovery] INFO : Looking at
> mutations from
> hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237
> for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
>
> 2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : exception trying to
> assign tablet
> i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
> hdfs://master01:9000/user/accumulo/accumulo/tables/i6/t-011gdek
>
> java.lang.RuntimeException: java.io.IOException: java.io.EOFException
>
>         at
> org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:639)
>
>         at
> org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)
>
>         at
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2157)
>
>         at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>
>         at
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>
>         at
> org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>         at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.io.IOException: java.io.EOFException
>
>         at
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:456)
>
>         at
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>
>         at
> org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:589)
>
>         ... 9 more
>
> Caused by: java.io.EOFException
>
>         at java.io.DataInputStream.readFully(DataInputStream.java:197)
>
>         at java.io.DataInputStream.readFully(DataInputStream.java:169)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1848)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813)
>
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762)
>
>         at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:443)
>
>         at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
>
>         at
> org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)
>
>         at
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>
>         at
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:454)
>
>         ... 11 more
>
> 2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : java.io.IOException:
> java.io.EOFException
>
> 2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : failed to open
> tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
> reporting failure to master
>
> 2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : rescheduling tablet
> load in 600.00 seconds
>
>
>
>
>
> The same structure error is occurring on many (if not all, all that I have
> so far checked) nodes across the cluster.  From what I have looked at
> 8bd07d5c-710f-4072-b351-8ce09d771237 appears to be a common feature, while
> the other elements vary.
>
>
>
> The file
> hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished
> exists and is zero bytes.  The folder
> hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237
> has two further folders within, part-r-00000 and part-r-00001.  Both have
> files within called data and index.
>
>
>
> The data file in part-r-00000 is 1071KB ends abruptly, thus:
>
>
>
>        +  2xœc```
> ------------------------------
>     f   Z
> ------------------------------
>        "  3
> Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05htsj1.rfxœc```
>
> ------------------------------
>
>
>        #  3xœc```
> ------------------------------
>     f   Z
> ------------------------------
>        $  3
> Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05huonw.rfxœc```
>
> ------------------------------
>
>
>        %  3xœc```
> ------------------------------
>     f   Z
> ------------------------------
>        &  3 Khdfs://master01:9000/u
>
>
>
> The index file in part-r-00000 is zero bytes.
>
>
>
> The data file in part-r-00001 is 31906KB and ends thus (which looks
> reasonable to me):
>
>
>
> ------------------------------
>
>    
> xœc```lP22ª3¨+È/vI,ItKM,)-J­+54©320´006434.b*MUI5HK1661ÒM³05Õ5ILLÓµ0KNÓM5²HL23H1624e``pc
> [1] .D(ÞÊ ´=I_´û#£ƒ _«
> ------------------------------
> PŸƒçά
> ------------------------------
> @ZÓˆ±÷šÛD ^í)•   Ì
>
> ------------------------------
>
>           
> xœc```jP22ª3¨+È/vI,ItKM,)-J­+54©320´006434.b*ÎS17HK1661ÒM³05Õ5±0KÑM2N5ÖMJ³02LI34KM5g``pc
> [1] .D(ÞÎÈÀÀ˜¤/ºo#£
> ------------------------------
> ßÔç
> @}
> ž«š iM#ÆÞkn“ˆs[J±J*Â: s uÉ&ºII&)ºÉf‰ ©‰Ææ–f¦fp· ¡xÄm[1]å @·ñžÛÜ v[—
> ------------------------------
> Üm“70Rv
>
>
>
> The index file in part-t-00001 is 5KB and equally looks reasonable.
>
>
>
> Any help or direction that you might be able to give would be most
> gratefully received.
>
>
>
> Best regards,
>
>
>
> Nick
>
>
>
>
> This email (and any attachments) may contain confidential information and
> is intended solely for the recipient(s) to whom the email is addressed. If
> you received this email in error, please inform us immediately and delete
> the email and all attachments without further using, copying or disclosing
> the information. This email and any attachments are believed to be, but
> cannot be guaranteed to be, secure or virus-free. Satellite Applications
> Catapult Limited is registered in England & Wales. Company Number: 7964746.
> Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot,
> Oxfordshire OX11 0QR.
>

Reply via email to