fmiguelez commented on issue #13167: URL: https://github.com/apache/pulsar/issues/13167#issuecomment-1158852504
> @fmiguelez Your bookie replica should be more than 3 but the returned readwrite bookie is only `rev-pulsar-bookie-2`. > > Is the bookie pod `rev-pulsar-bookie-0` and `rev-pulsar-bookie-1` still running? You can check their disk free status by https://bookkeeper.apache.org/docs/admin/http#endpoint-apiv1bookieinfo > > In the toolset pod, you can try `curl rev-pulsar-bookie-0.rev-pulsar-bookie.mdp-dgt-review-check-data-ocdup2.svc.cluster.local:8000/api/v1/bookie/info` It was actually a matter of disk space. I reviews logs of all bookies and reported disk usage beyond threshold, so they were set in read-only mode. ``` fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-0 09:05:58.784 [LedgerDirsMonitorThread] ERROR org.apache.bookkeeper.util.DiskChecker - Space left on device /pulsar/data/bookkeeper/ledgers/current : 2447400960, Used space fraction: 0.953402 > threshold 0.95. 09:05:58.826 [SyncThread-7-1] INFO org.apache.bookkeeper.bookie.LedgerDirsManager - No writable ledger dirs below diskUsageThreshold. But Dirs that can accommodate 1073741824 are: [/pulsar/data/bookkeeper/ledgers/current] fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-1 09:05:57.685 [LedgerDirsMonitorThread] ERROR org.apache.bookkeeper.util.DiskChecker - Space left on device /pulsar/data/bookkeeper/ledgers/current : 2414260224, Used space fraction: 0.95403296 > threshold 0.95. 09:05:57.734 [SyncThread-7-1] INFO org.apache.bookkeeper.bookie.LedgerDirsManager - No writable ledger dirs below diskUsageThreshold. But Dirs that can accommodate 1073741824 are: [/pulsar/data/bookkeeper/ledgers/current] fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-2 08:44:00.631 [bookie-io-1-1] INFO org.apache.bookkeeper.proto.BookieRequestHandler - Channels disconnected: [id: 0x2b678089, L:/10.72.168.230:3181 ! R:/10.72.168.138:39958] 08:56:59.187 [GarbageCollectorThread-11-1] INFO org.apache.bookkeeper.bookie.GarbageCollectorThread - Disk almost full, suspend major compaction to slow down filling disk. fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-3 09:07:18.867 [LedgerDirsMonitorThread] WARN org.apache.bookkeeper.bookie.LedgerDirsMonitor - LedgerDirsMonitor check process: All ledger directories are non writable 09:07:18.868 [LedgerDirsMonitorThread] ERROR org.apache.bookkeeper.util.DiskChecker - Space left on device /pulsar/data/bookkeeper/ledgers/current : 2292150272, Used space fraction: 0.9563579 > threshold 0.95. ``` I modified the bookie configmap to set a threshold of 0.98. ` fer@N315D154:~$ kubectl -n mdp-dgt-review-check-data-ocdup2 get configmaps rev-pulsar-bookie -o yaml > rev-pulsar-bookie-cm.yaml ` Edit `pulsar-bookie-cm.yml` and add following property: `diskUsageThreshold: "0.98"` And then applied changes: `fer@N315D154:~$ kubectl -n mdp-dgt-review-check-data-ocdup2 apply -f rev-pulsar-bookie-cm.yaml ` For the bookie pods to get the change I deleted manually one by one the bookie replica pods. After that broker was able to start. However this situation lasted not long as eventually two of the bookies started to occupy all the space. I analyzed all the topics stats and actually some of them returned error: ``` root@rev-pulsar-toolset-0:/pulsar/bin# for i in $(./pulsar-admin namespaces list dbus); do topics=$(./pulsar-admin topics list $i); for topic in $topics; do echo $topic; ./pulsar-admin topics stats $topic | grep storageSize; done; done "persistent://dbus/dgt-measure-point-15min-local/traffic-data-events" "storageSize" : 43968, "persistent://dbus/dgt-measure-point-15min-local/status" HTTP 500 Server Error Reason: HTTP 500 Server Error "persistent://dbus/dgt-time-series-point/traffic-data-events" "storageSize" : 2732639642, "persistent://dbus/dgt-time-series-point/status" HTTP 500 Server Error Reason: HTTP 500 Server Error "persistent://dbus/test-ltp-config/variables" "storageSize" : 0, ``` Every bookie replica pod has a volume of 50 GiB for ledgers and 10 GiB for journals. The volumen being filled up was the one of the ledgers. As you can see only one topic has a considerable size (2.5 GiB), which does not make any sense. There are no producers producing any data. My suspicion is that the compaction process (enabled for all topics and set to 10 MB threshold) is the one filling up disk (probably with temporary data copy). It can be a problem as biggest topic does not have any repeating key (compaction does not make any sense in this case). I have observed logs like the following ``` 11:19:53.242 [GarbageCollectorThread-11-1] INFO org.apache.bookkeeper.bookie.GarbageCollectorThread - Disk almost full, suspend major compaction to slow down filling disk. 11:19:53.242 [GarbageCollectorThread-11-1] INFO org.apache.bookkeeper.bookie.GarbageCollectorThread - Enter minor compaction, suspendMinor false 11:19:53.242 [GarbageCollectorThread-11-1] INFO org.apache.bookkeeper.bookie.GarbageCollectorThread - Do compaction to compact those files lower than 0.2 ``` I have removed my namespace and redeployed two versions of the services: * On without compaction enabled. * Another one identical to the one reporting the issues. Will check on Monday differences between both environments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
