This was a great question. I want start recording answers to these types of questions in the troubleshooting documentation[1] for 2.0. I made a pull request[2] to the website repo for this one if anyone wants to review/comment on it.
[1]: https://accumulo.apache.org/docs/unreleased/troubleshooting/basic [2]: https://github.com/apache/accumulo-website/pull/18 On Wed, Jul 5, 2017 at 3:32 PM Christopher <[email protected]> wrote: > Huge GC pauses can be mitigated by ensuring you're using the Accumulo > native maps library. > > On Wed, Jul 5, 2017 at 11:05 AM Cyrille Savelief <[email protected]> > wrote: > >> Hi Massimilian*,* >> >> Using a MultiTableBatchWriter we are able to ingest about 600K entries/s >> on a single node (30Gb of memory, 8 vCPU) running Hadoop, Zookeeper, >> Accumulo and our ingest process. For us, "valleys" came from huge GC pauses. >> >> Best, >> >> Cyrille >> >> Le mer. 5 juil. 2017 à 14:37, Massimilian Mattetti <[email protected]> >> a écrit : >> >>> Hi all, >>> >>> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each >>> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as >>> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10 >>> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are >>> running Accumulo TServer processes. All the machines are connected via a >>> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy >>> ingestion test on this cluster but I have never been able to reach more >>> than *20% *CPU usage on each Tablet Server. I am running an ingestion >>> process (using batch writers) on each data node. The table is pre-split in >>> order to have 4 tablets per tablet server. Monitoring the network I have >>> seen that data is received/sent from each node with a peak rate of about >>> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet >>> servers is around 120MB/s. >>> >>> The table configuration I am playing with are: >>> "table.file.replication": "2", >>> "table.compaction.minor.logs.threshold": "10", >>> "table.durability": "flush", >>> "table.file.max": "30", >>> "table.compaction.major.ratio": "9", >>> "table.split.threshold": "1G" >>> >>> while the tablet server configuration is: >>> "tserver.wal.blocksize": "2G", >>> "tserver.walog.max.size": "8G", >>> "tserver.memory.maps.max": "32G", >>> "tserver.compaction.minor.concurrent.max": "50", >>> "tserver.compaction.major.concurrent.max": "8", >>> "tserver.total.mutation.queue.max": "50M", >>> "tserver.wal.replication": "2", >>> "tserver.compaction.major.thread.files.open.max": "15" >>> >>> the tablet server heap has been set to 32GB >>> >>> From Monitor UI >>> >>> >>> As you can see I have a lot of valleys in which the ingestion rate >>> reaches 0. >>> What would be a good procedure to identify the bottleneck which causes >>> the 0 ingestion rate periods? >>> Thanks. >>> >>> Best Regards, >>> Max >>> >>>
