keith-turner commented on issue #5712:
URL: https://github.com/apache/accumulo/issues/5712#issuecomment-3050635696
> I also ran the bulk random walk test with Accumulo version 2.1.4rc0 and
Zookeeper version 3.9.2, and saw zookeeper's CPU usage fairly high across the
whole of the test.
That was helpful, made me realize it was a more general problem than just
after things got stuck.
Looked into this some more and enabled zookeeper server trace logging while
running the test. Looking at the ZK trace logs, saw that getChildren was the
most frequent operation.
```
$ grep reqpath zookeeper_trace.log | grep -o -E "type:[a-zA-Z]+" | sort |
uniq -c
10605 type:create
8429 type:delete
59636 type:exists
11450430 type:getChildren
1715452 type:getData
390 type:ping
3736 type:setData
36 type:setWatches
3512 type:sync
13235432 type:unknown
```
Then based on that found the following was the most frequently accessed path
for the call to getChildren.
```
$ grep getChildren zookeeper_trace.log | grep -o -E "reqpath:[^ ]+" | sort
| uniq -c | sort -r -n -k 1 | head
11201694 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate
134968 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/table_locks/2
13664
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/table_locks/+default
3748
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_6911b96289b3d855
3688
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_0b06908ef4e33636
3616
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_6b629f2b27214a8e
3188
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_46d92e738ee34ef3
2872
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_5e3af9674c1c87a8
2392
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_20b3516eb7a7c1d7
1920
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_776533e52e653bca
```
That led to opening #5728. I had run into problems in the manager w/ that
code a year or two ago, did not realize it was placing such load on zookeeper
in addition to the manager. Makes sense though. Seeing much lower CPU usage
on the zookeeper server when running bulk randomwalk test w/ #5728. Did not
enable trace logging again and look at the counts for get children though.
Also found that zookeeper has audit logging, tried turning that on before
the trace logging. Its useful, but only shows writes and in this case it was
excessive reads that were causing the problem.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]