Hi devs, I'd like to ask your help in figuring out what is happening to a BatchWriter. The following gives my reasoning so far.
In Accumulo 1.7.1, I have a BatchWriter that is stuck in WAITING status in its addMutation method. I saw that it is stuck by jstack'ing the Accumulo client. It's been stuck like this for 16 hours. The BatchWriter is supposed to wait when a mutation is added if no failures have recorded and either (a) the total memory used exceeds the maximum allowed for the BatchWriter, or (b) the batchwriter is currently flushed. So we conclude that one of (a) or (b) have occurred and no failures were recorded, at the time when addMutation was called. I think (a) is likely. The BatchWriter is supposed to notify itself when either (1) a flush finishes, (2) a constraint violation or authorization failure or server error or unknown error occurs, (3) memory usage decreases, which happens when entries successfully send to the tablet server. Since the BatchWriter is stuck on WAITING, none of these conditions are occurring. The BatchWriter has 3 write threads (the default number). All three have status TIMED_WAITING (parked) in jstack. Their stack traces don't give useful information. Here's what I can tell from the tserver logs. A new table (and tablet) was created successfully. The BatchWriter started writing to this tablet steadily. The logs show that the tablet (vb<<) flushed every 5 seconds or so and major compacted at a steady periodic rate. Everything looks good, until vb<< grew large enough that it needed splitting. This occurred about 42 minutes after the BatchWriter started writing entries. The logs show a failure in an UpdateSession that popped up in the middle of the split operation. This failure continues to show for the next 15 hours. I copied the portion of the tserver logs that look relevant to the split below. I highlighted the line reporting the first failure. It occurs in between when the split starts and when it finishes. Any idea what could have caused this? I don't know if the failure is related to the BatchWriter being stuck in WAITING. It seems likely. I think it is weird that the 3 write threads are all idle; at least one of them should be doing something if the thread calling addMutation() is waiting. Here is a pastebin of the jstack <http://pastebin.com/TLfdrDwX>, though I think I wrote the useful parts from it. 2016-04-17 22:38:06,436 [tablet.Tablet] TABLET_HIST: vb<< closed 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for low split vb;2436< [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs: //localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf, hdfs://localhost:90 00/accumulo/tables/vb/default_tablet/M0000a9g.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf] 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for high split vb<;2436 [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs ://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf, hdfs://localhost:9 000/accumulo/tables/vb/default_tablet/M0000a9g.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf, hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf] 2016-04-17 22:38:06,440 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:42302 !0 0 entries in 0.00 secs, nbTimes = [0 0 0.00 1] 2016-04-17 22:38:06,472 [tablet.Tablet] TABLET_HIST: vb<< split vb;2436< vb<;2436 2016-04-17 22:38:06,472 [tablet.Tablet] DEBUG: offline split time : 3.22 secs *2016-04-17 22:38:06,472 [tserver.TabletServer] INFO : Starting split: vb<<* 2016-04-17 22:38:06,472 [tserver.TabletServer] DEBUG: UpSess 127.0.0.1:41090 0 in 3.059s, at=[0 0 0.00 1] ft=3.010s(pt=3.010s lt=0.000s ct=0.000s) 2016-04-17 22:38:06,472 [tserver.NativeMap] DEBUG: Allocated native map 0x0000000009a20e60 *2016-04-17 22:38:06,473 [tserver.TabletServer] DEBUG: Failures: 1, first extent vb<< successful commits: 0* 2016-04-17 22:38:06,473 [tablet.Tablet] DEBUG: Reloading constraints for extent: vb;2436< 2016-04-17 22:38:06,473 [constraints.ConstraintChecker] DEBUG: Loaded constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint for vb 2016-04-17 22:38:06,474 [tablet.Tablet] TABLET_HIST: vb;2436< opened 2016-04-17 22:38:06,474 [tserver.NativeMap] DEBUG: Allocated native map 0x0000000003edf090 2016-04-17 22:38:06,474 [tablet.Tablet] DEBUG: Reloading constraints for extent: vb<;2436 2016-04-17 22:38:06,474 [constraints.ConstraintChecker] DEBUG: Loaded constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint for vb 2016-04-17 22:38:06,475 [tablet.Tablet] TABLET_HIST: vb<;2436 opened *2016-04-17 22:38:06,475 [tserver.TabletServer] INFO : Tablet split: vb<< size0 539900144 size1 539900148 time 8268ms* 2016-04-17 22:38:06,594 [tserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1) 2016-04-17 22:38:06,597 [tserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1) 2016-04-17 22:38:07,568 [tserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:42307 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 2016-04-17 22:38:07,571 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/c7998e9d-129d-4c61-9732-45bb6bcab3f0/bulk_failed_copyq 2016-04-17 22:38:07,940 [tserver.TabletServer] DEBUG: UpSess 127.0.0.1:41090 0 in 0.067s, at=[1 1 1.00 1] ft=0.000s(pt=0.000s lt=0.000s ct=0.000s) 2016-04-17 22:38:07,941 [tserver.TabletServer] DEBUG: Failures: 1, first extent vb<< successful commits: 0
