Hello,
I’m seeing a lot of errors such as the following across my production cluster,
which has 30 nodes and is running Accumulo 1.7 on Hadoop 2.7.1. The system has
been running for many months without error. I would appreciate any guidance
that can be given particularly if I should, or indeed should not, stop the
cluster in order to resolve.
There are over 60 billion elements in the table associated with this tablet,
and rebuilding from scratch would be very difficult. I can stand some data
loss and re-ingest recent data, if that restores service.
From /usr/local/accumulo/logs/master_master02.log:
2018-03-17 22:02:16,527 [master.Master] ERROR: node11:9997 reports assignment
failed for tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
2018-03-17 22:02:16,560 [master.Master] ERROR: node25:9997 reports assignment
failed for tablet
i7;^A^@^C51008b0d-1fc7-4742-bc4f-67ec280c7ebc^@80000152d;^A^@^C50fb7a13-1f0f-4943-a94f-26fcd8d15439^@8000014d1e
2018-03-17 22:02:16,574 [master.Master] ERROR: node30:9997 reports assignment
failed for tablet
i7;^A^@^C52d07ff3-677b-44b2-bef5-23d027946401^@80000159fc;^A^@^C52cc4173-6e57-4ef0-81a7-879e77a7d820^@80000152
2018-03-17 22:02:16,586 [master.Master] ERROR: node06:9997 reports assignment
failed for tablet
i6;02~1~posDataFeature~gcn~20170228;02~1~posDataFeature~gbv~201608
2018-03-17 22:02:16,589 [master.Master] ERROR: node16:9997 reports assignment
failed for tablet
i6;00~1~posDataFeature~tbr~2016081;00~1~posDataFeature~t9w~201604
2018-03-17 22:02:16,616 [master.Master] ERROR: node26:9997 reports assignment
failed for tablet
i6;02~1~posDataFeature~dxf~201607;02~1~posDataFeature~dvt~201602
2018-03-17 22:02:16,694 [master.Master] ERROR: node17:9997 reports assignment
failed for tablet
i6;17~0~posDataFeature~k42~201505;17~0~posDataFeature~k3u~20160818
2018-03-17 22:02:16,778 [master.Master] ERROR: node07:9997 reports assignment
failed for tablet
i6;10~0~posDataFeature~u33~2017122;10~0~posDataFeature~u1x~20160207
2018-03-17 22:02:16,810 [master.Master] ERROR: node05:9997 reports assignment
failed for tablet
i6;10~1~posDataFeature~rqe~20140805;10~1~posDataFeature~rqc~20160309
2018-03-17 22:02:16,825 [master.Master] ERROR: node18:9997 reports assignment
failed for tablet
i6;06~1~posDataFeature~6pn~2015082;06~1~posDataFeature~6nx~20170514
2018-03-17 22:02:16,827 [master.Master] ERROR: node33:9997 reports assignment
failed for tablet
i6;09~1~posDataFeature~fu4~20170624;09~1~posDataFeature~fgc~20160928
2018-03-17 22:02:16,859 [master.Master] ERROR: node31:9997 reports assignment
failed for tablet
i7;^A^@^Cc5e97d3d-f3d0-4c80-acfb-b4de1d2aaa0e^@8000014a62;^A^@^Cc5e44ab1-a0a1-4d9c-abce-250a27209c15^@80000156fd
2018-03-17 22:02:16,920 [master.Master] ERROR: node14:9997 reports assignment
failed for tablet
i6;22~0~posDataFeature~xqs~2016;22~0~posDataFeature~xn5~2016012
2018-03-17 22:02:16,938 [master.Master] ERROR: node22:9997 reports assignment
failed for tablet
i6;23~1~posDataFeature~kdm~2015013;23~1~posDataFeature~kdh~20160708
2018-03-17 22:02:16,981 [master.Master] ERROR: node15:9997 reports assignment
failed for tablet i6;07~1~posDataFeature~w7e~2016092;07~1~posDataFeature~w49
2018-03-17 22:02:17,194 [master.Master] ERROR: node21:9997 reports assignment
failed for tablet
i7;^A^@^C92dba50b-219b-46d0-932a-a12a58ca830f^@800001523;^A^@^C92da7a96-7011-41d0-82ea-1a39ae7b1a6d^@8000015959
2018-03-17 22:02:17,209 [master.Master] ERROR: node13:9997 reports assignment
failed for tablet
i7;^A^@^Cc05;^A^@^Cc04d6645-9a4b-42b8-a8cc-58ff19e0b957^@800001465
2018-03-17 22:02:17,574 [master.Master] ERROR: node30:9997 reports assignment
failed for tablet i6;17~1~posDataFeature~e7;17~1~posDataFeature~e4r~20160509
2018-03-17 22:02:17,586 [master.Master] ERROR: node06:9997 reports assignment
failed for tablet
i6;01~1~posDataFeature~d3v~20160222;01~1~posDataFeature~d3f~201707072
2018-03-17 22:02:17,590 [master.Master] ERROR: node16:9997 reports assignment
failed for tablet
i7;^A^@^C88fe01d0-243b-487c-ab0d-30b02e8ccf69^@80000152c;^A^@^C88f2eba5-9c1f-4230-8dd2-d1a9f0f659bd^@8000015d
2018-03-17 22:02:17,617 [master.Master] ERROR: node26:9997 reports assignment
failed for tablet i6;05~1~posDataFeature~d1f;05~1~posDataFeature~d0t~201507
2018-03-17 22:02:17,694 [master.Master] ERROR: node17:9997 reports assignment
failed for tablet
i7;^A^@^C3df85419-e448-4fa3-87b3-ec95edae204b^@80000153;^A^@^C3df47e59-05af-43e9-a650-50685f66ec0e^@80000153d
2018-03-17 22:02:17,827 [master.Master] ERROR: node33:9997 reports assignment
failed for tablet
i7;^A^@^C50f11344-0f80-435e-a6fa-7312619e1535^@80000142b8;^A^@^C50eb8126-a75c-4d89-9767-2a6a71c6bfac^@800001552a
2018-03-17 22:02:17,859 [master.Master] ERROR: node31:9997 reports assignment
failed for tablet i6;05~0~posDataFeature~pz8;05~0~posDataFeature~my0~20141
2018-03-17 22:02:17,920 [master.Master] ERROR: node14:9997 reports assignment
failed for tablet
i6;14~1~posDataFeature~7ej~201606;14~1~posDataFeature~7ds~201608
2018-03-17 22:02:17,938 [master.Master] ERROR: node22:9997 reports assignment
failed for tablet
i6;18~1~posDataFeature~u3f~2016031;18~1~posDataFeature~u3b~20150816
2018-03-17 22:02:17,981 [master.Master] ERROR: node15:9997 reports assignment
failed for tablet
i6;14~0~posDataFeature~wtq~20151019;14~0~posDataFeature~wt3~20170605
2018-03-17 22:02:18,194 [master.Master] ERROR: node21:9997 reports assignment
failed for tablet
i6;11~0~posDataFeature~dyq~201607;11~0~posDataFeature~drm~201507
From /usr/local/accumulo/logs/ tserver_node11.log
2018-03-17 22:02:16,516 [tserver.TabletServer] INFO : adding tablet
i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 back to the
assignment pool (retry 129)
2018-03-17 22:02:16,517 [tserver.TabletServer] INFO : node11:9997: got
assignment from master:
i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
2018-03-17 22:02:16,521 [tablet.Tablet] INFO : Starting Write-Ahead Log
recovery for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
2018-03-17 22:02:16,521 [tserver.TabletServer] INFO : Looking for
hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished
2018-03-17 22:02:16,521 [log.SortedLogRecovery] INFO : Looking at mutations
from
hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237
for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : exception trying to
assign tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404
hdfs://master01:9000/user/accumulo/accumulo/tables/i6/t-011gdek
java.lang.RuntimeException: java.io.IOException: java.io.EOFException
at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:639)
at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)
at
org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2157)
at
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at
org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.io.EOFException
at
org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:456)
at
org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:589)
... 9 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1848)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762)
at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:443)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
at
org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)
at
org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
at
org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:454)
... 11 more
2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : java.io.IOException:
java.io.EOFException
2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : failed to open tablet
i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 reporting
failure to master
2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : rescheduling tablet load
in 600.00 seconds
The same structure error is occurring on many (if not all, all that I have so
far checked) nodes across the cluster. From what I have looked at
8bd07d5c-710f-4072-b351-8ce09d771237 appears to be a common feature, while the
other elements vary.
The file
hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished
exists and is zero bytes. The folder
hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237
has two further folders within, part-r-00000 and part-r-00001. Both have
files within called data and index.
The data file in part-r-00000 is 1071KB ends abruptly, thus:
+ 2xœc```
________________________________
f Z
________________________________
" 3
Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05htsj1.rfxœc```
________________________________
# 3xœc```
________________________________
f Z
________________________________
$ 3
Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05huonw.rfxœc```
________________________________
% 3xœc```
________________________________
f Z
________________________________
& 3 Khdfs://master01:9000/u
The index file in part-r-00000 is zero bytes.
The data file in part-r-00001 is 31906KB and ends thus (which looks reasonable
to me):
________________________________
xœc```lP22ª3¨+È/vI,ItKM,)-J+54©320´006434.b*MUI5HK1661ÒM³05Õ5ILLÓµ0KNÓM5²HL23H1624e``pc[1].D(ÞÊ´=I_´û#£ƒ
_«
________________________________
PŸƒçά
________________________________
@ZÓˆ±÷šÛD ^í)• Ì
________________________________
xœc```jP22ª3¨+È/vI,ItKM,)-J+54©320´006434.b*ÎS17HK1661ÒM³05Õ5±0KÑM2N5ÖMJ³02LI34KM5g``pc[1].D(ÞÎÈÀÀ˜¤/ºo#£
________________________________
ßÔç
@}
ž«š iM#ÆÞkn“ˆs[J±J*Â:s uÉ&ºII&)ºÉf‰©‰Ææ–f¦fp·¡xÄm[1]å@·ñžÛÜ v[—
________________________________
Üm“70Rv
The index file in part-t-00001 is 5KB and equally looks reasonable.
Any help or direction that you might be able to give would be most gratefully
received.
Best regards,
Nick
This email (and any attachments) may contain confidential information and is
intended solely for the recipient(s) to whom the email is addressed. If you
received this email in error, please inform us immediately and delete the email
and all attachments without further using, copying or disclosing the
information. This email and any attachments are believed to be, but cannot be
guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited
is registered in England & Wales. Company Number: 7964746. Registered office:
Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.