[GitHub] [pulsar] fmiguelez commented on issue #13167: Not enough non-faulty bookies available

GitBox Fri, 17 Jun 2022 06:06:33 -0700


fmiguelez commented on issue #13167:
URL: https://github.com/apache/pulsar/issues/13167#issuecomment-1158852504


   > @fmiguelez Your bookie replica should be more than 3 but the returned 
readwrite bookie is only `rev-pulsar-bookie-2`.
   > 
   > Is the bookie pod `rev-pulsar-bookie-0` and `rev-pulsar-bookie-1` still 
running? You can check their disk free status by 
https://bookkeeper.apache.org/docs/admin/http#endpoint-apiv1bookieinfo
   > 
   > In the toolset pod, you can try `curl 
rev-pulsar-bookie-0.rev-pulsar-bookie.mdp-dgt-review-check-data-ocdup2.svc.cluster.local:8000/api/v1/bookie/info`
   
   It was actually a matter of disk space. I reviews logs of all bookies and 
reported disk usage beyond threshold, so they were set in read-only mode. 
   
   ```
        fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n 
mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-0
        09:05:58.784 [LedgerDirsMonitorThread] ERROR 
org.apache.bookkeeper.util.DiskChecker - Space left on device 
/pulsar/data/bookkeeper/ledgers/current : 2447400960, Used space fraction: 
0.953402 > threshold 0.95.
        09:05:58.826 [SyncThread-7-1] INFO  
org.apache.bookkeeper.bookie.LedgerDirsManager - No writable ledger dirs below 
diskUsageThreshold. But Dirs that can accommodate 1073741824 are: 
[/pulsar/data/bookkeeper/ledgers/current]
        fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n 
mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-1
        09:05:57.685 [LedgerDirsMonitorThread] ERROR 
org.apache.bookkeeper.util.DiskChecker - Space left on device 
/pulsar/data/bookkeeper/ledgers/current : 2414260224, Used space fraction: 
0.95403296 > threshold 0.95.
        09:05:57.734 [SyncThread-7-1] INFO  
org.apache.bookkeeper.bookie.LedgerDirsManager - No writable ledger dirs below 
diskUsageThreshold. But Dirs that can accommodate 1073741824 are: 
[/pulsar/data/bookkeeper/ledgers/current]
        fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n 
mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-2
        08:44:00.631 [bookie-io-1-1] INFO  
org.apache.bookkeeper.proto.BookieRequestHandler - Channels disconnected: [id: 
0x2b678089, L:/10.72.168.230:3181 ! R:/10.72.168.138:39958]
        08:56:59.187 [GarbageCollectorThread-11-1] INFO  
org.apache.bookkeeper.bookie.GarbageCollectorThread - Disk almost full, suspend 
major compaction to slow down filling disk.
        fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n 
mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-3
        09:07:18.867 [LedgerDirsMonitorThread] WARN  
org.apache.bookkeeper.bookie.LedgerDirsMonitor - LedgerDirsMonitor check 
process: All ledger directories are non writable
        09:07:18.868 [LedgerDirsMonitorThread] ERROR 
org.apache.bookkeeper.util.DiskChecker - Space left on device 
/pulsar/data/bookkeeper/ledgers/current : 2292150272, Used space fraction: 
0.9563579 > threshold 0.95.  
   
   ```
   I modified the bookie configmap to set a threshold of 0.98. 
   
   `    fer@N315D154:~$ kubectl -n mdp-dgt-review-check-data-ocdup2 get 
configmaps rev-pulsar-bookie -o yaml > rev-pulsar-bookie-cm.yaml
   `
   Edit `pulsar-bookie-cm.yml` and add following property:
     `diskUsageThreshold: "0.98"`
   And then applied changes:
    `fer@N315D154:~$ kubectl -n mdp-dgt-review-check-data-ocdup2 apply -f 
rev-pulsar-bookie-cm.yaml `
   
   For the bookie pods to get the change I deleted manually one by one the 
bookie replica pods. After that broker was able to start. 
   
   However this situation lasted not long as eventually two of the bookies 
started to occupy all the space.
   
   I analyzed all the topics stats and actually some of them returned error:
   
   ```
        root@rev-pulsar-toolset-0:/pulsar/bin# for i in $(./pulsar-admin 
namespaces list dbus); do topics=$(./pulsar-admin topics list $i); for topic in 
$topics; do echo $topic; ./pulsar-admin topics stats $topic | grep storageSize; 
done; done
        "persistent://dbus/dgt-measure-point-15min-local/traffic-data-events"
          "storageSize" : 43968,
        "persistent://dbus/dgt-measure-point-15min-local/status"
        HTTP 500 Server Error
   
        Reason: HTTP 500 Server Error
        "persistent://dbus/dgt-time-series-point/traffic-data-events"
          "storageSize" : 2732639642,
        "persistent://dbus/dgt-time-series-point/status"
        HTTP 500 Server Error
   
        Reason: HTTP 500 Server Error
        "persistent://dbus/test-ltp-config/variables"
          "storageSize" : 0,
   
   ```
   
   Every bookie replica pod has a volume of 50 GiB for ledgers and 10 GiB for 
journals. The volumen being filled up was the one of the ledgers. As you can 
see only one topic has a considerable size (2.5 GiB), which does not make any 
sense. 
   
   There are no  producers producing any data. 
   
   My suspicion is that the compaction process  (enabled for all topics and set 
to 10 MB threshold)  is the one filling up disk (probably with temporary data 
copy). It can be a problem as biggest topic does not have any repeating key 
(compaction does not make any sense in this case). I have observed logs like 
the following 
   
   ```
   11:19:53.242 [GarbageCollectorThread-11-1] INFO  
org.apache.bookkeeper.bookie.GarbageCollectorThread - Disk almost full, suspend 
major compaction to slow down filling disk.
   11:19:53.242 [GarbageCollectorThread-11-1] INFO  
org.apache.bookkeeper.bookie.GarbageCollectorThread - Enter minor compaction, 
suspendMinor false
   11:19:53.242 [GarbageCollectorThread-11-1] INFO  
org.apache.bookkeeper.bookie.GarbageCollectorThread - Do compaction to compact 
those files lower than 0.2
   ```
   
   I have removed my namespace and redeployed two versions of the services:
   * On without compaction enabled.
   * Another one identical to the one reporting the issues.
   
   Will check on Monday differences between both environments.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] fmiguelez commented on issue #13167: Not enough non-faulty bookies available

Reply via email to