Hey Jochen, tried with Xmx=30Gb changed the GC algo and i still have big pauses...
2016-08-24T17:06:30.042Z WARN [GarbageCollectionWarningThread] Last GC run with G1 Young Generation took longer than 1 second (last duration=10443 milliseconds) 2016-08-24T17:06:30.042Z WARN [jvm] [graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][young][65][35] duration [10.4s], collections [1]/[10.6s], total [10.4s]/[36.9s], memory [29.2gb]->[29.3gb]/[30gb], all_pools {[young] [13.1gb]->[8mb]/[0b]}{[survivor] [56mb]->[152mb]/[0b]}{[old] [15.9gb]->[29.1gb]/[30gb]} 2016-08-24T17:06:30.391Z WARN [NodePingThread] Did not find meta info of this node. Re-registering. 2016-08-24T17:06:51.065Z WARN [NodePingThread] Did not find meta info of this node. Re-registering. 2016-08-24T17:06:51.065Z WARN [jvm] [graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][old][66][2] duration [18.8s], collections [1]/[20.9s], total [18.8s]/[41.6s], memory [29.3gb]->[8.5gb]/[30gb], all_pools {[young] [8mb]->[0b]/[0b]}{[survivor] [152mb]->[0b]/[0b]}{[old] [29.1gb]->[8.5gb]/[30gb]} 2016-08-24T17:06:51.276Z WARN [GarbageCollectionWarningThread] Last GC run with G1 Young Generation took longer than 1 second (last duration=1299 milliseconds) 2016-08-24T17:06:51.284Z WARN [GarbageCollectionWarningThread] Last GC run with G1 Old Generation took longer than 1 second (last duration=18865 milliseconds) Will let it run for a hour then ill post the GC logs here On Wednesday, 24 August 2016 14:12:45 UTC+1, Jochen Schalanda wrote: > > Hi Ricardo, > > try configuring *less* heap memory for your JVM, ideally less than 32G. > See > https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/ > > for details. > > Cheers, > Jochen > > On Wednesday, 24 August 2016 15:02:10 UTC+2, Ricardo Ferreira wrote: >> >> So, we have a 2 identical node graylog setup: >> >> >> ProLiant DL380 Gen9 (719061-B21) >> 32: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz >> 128GB RAM >> 20MiB L3 cache >> 2147479552 bytes (2.1 GB) copied, 12.8211 s, 167 MB/s >> >> 016-08-24T10:46:34.255Z INFO [CmdLineTool] Running with JVM arguments: >> -Xms32g -Xmx32g -XX:NewRatio=1 -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC >> -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC >> -XX:-OmitStackTraceInFastThrow -Xloggc:/tmp/memoryGC.log >> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps >> -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml >> -Djava.library.path=/usr/share/graylog-server/lib/sigar >> -Dgraylog2.installation_source=deb >> >> different from default configs: >> >> >> elasticsearch_shards = 5 >> elasticsearch_replicas = 1 >> >> processbuffer_processors = 15 >> outputbuffer_processors = 12 >> ring_size = 4194304 >> inputbuffer_ring_size = 4194304 >> inputbuffer_processors = 6 >> inputbuffer_wait_strategy = blocking >> message_journal_max_size = 30gb >> async_eventbus_processors = 4 >> >> So our environment does ~ 30-50GB per hour in this testing phase we are >> at ~5kmsg/s >> If you look at the graphs we started doing a run just into one done and >> was changed to another node around 11:50UTC >> >> While it was running i noticed that the graylog node was de registering >> himself: >> >> 2016-08-24T10:46:57.254Z WARN [jvm] >> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][8][4] duration >> [3.1s], collections [1]/[3.9s], total [3.1s]/[3.8s], memory >> [9gb]->[2gb]/[30.4gb], all_pools {[young] >> [7.8gb]->[445.1mb]/[12.8gb]}{[survivor] [1.2gb]->[886.4mb]/[1.5gb]}{[old] >> [0b]->[770.9mb]/[16gb]} >> 2016-08-24T10:46:57.374Z INFO [connection] Opened connection >> [connectionId{localValue:22, serverValue:790}] to 172.24.9.59:27017 >> 2016-08-24T10:46:57.402Z INFO [connection] Opened connection >> [connectionId{localValue:21, serverValue:789}] to 172.24.9.59:27017 >> 2016-08-24T10:46:57.416Z INFO [connection] Opened connection >> [connectionId{localValue:23, serverValue:791}] to 172.24.9.59:27017 >> 2016-08-24T10:47:03.900Z INFO [jvm] >> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][13][9] duration >> [1.5s], collections [2]/[2.4s], total [1.5s]/[6s], memory >> [7.9gb]->[4.2gb]/[30.4gb], all_pools {[young] >> [5.2gb]->[1.5gb]/[12.8gb]}{[survivor] [1.4gb]->[599.3mb]/[1.5gb]}{[old] >> [1.1gb]->[2.9gb]/[16gb]} >> 2016-08-24T10:47:08.949Z INFO [jvm] >> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][17][12] duration >> [1.5s], collections [2]/[2s], total [1.5s]/[7.7s], memory >> [13.1gb]->[5.8gb]/[30.4gb], all_pools {[young] >> [8.9gb]->[910.8mb]/[12.8gb]}{[survivor] [1.2gb]->[102.5mb]/[1.5gb]}{[old] >> [2.9gb]->[4.8gb]/[16gb]} >> 2016-08-24T10:47:11.874Z INFO [connection] Opened connection >> [connectionId{localValue:28, serverValue:794}] to 172.24.9.59:27017 >> 2016-08-24T10:47:11.870Z INFO [connection] Opened connection >> [connectionId{localValue:29, serverValue:797}] to 172.24.9.59:27017 >> 2016-08-24T10:47:11.878Z INFO [connection] Opened connection >> [connectionId{localValue:24, serverValue:793}] to 172.24.9.59:27017 >> 2016-08-24T10:47:11.878Z INFO [connection] Opened connection >> [connectionId{localValue:27, serverValue:796}] to 172.24.9.59:27017 >> 2016-08-24T10:47:11.878Z INFO [connection] Opened connection >> [connectionId{localValue:25, serverValue:792}] to 172.24.9.59:27017 >> 2016-08-24T10:47:11.876Z INFO [connection] Opened connection >> [connectionId{localValue:26, serverValue:795}] to 172.24.9.59:27017 >> 2016-08-24T10:47:20.046Z WARN [jvm] >> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][25][19] duration >> [2s], collections [2]/[2.4s], total [2s]/[12.5s], memory >> [18.8gb]->[11.3gb]/[30.4gb], all_pools {[young] >> [9.6gb]->[360.1mb]/[12.8gb]}{[survivor] [1.5gb]->[813.1mb]/[1.5gb]}{[old] >> [7.6gb]->[10.2gb]/[16gb]} >> 2016-08-24T10:47:25.065Z WARN [jvm] >> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][29][21]* >> duration [1.3s]*, collections [1]/[2s], total [1.3s]/[14.1s], memory >> [20.3gb]->[13.5gb]/[30.4gb], all_pools {[young] >> [8.4gb]->[161.9mb]/[12.8gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] >> [10.2gb]->[11.8gb]/[16gb]} >> >> >> >> >> >> >> *2016-08-24T10:48:02.521Z WARN [NodePingThread] Did not find meta info >> of this node. Re-registering.nevertheless i had other runs when this >> problem manifests it self:* >> 2016-08-22T14:59:48.034Z INFO [jvm] >> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][62][18] duration >> [797ms], collections [1]/[1.5s], total [797ms]/[18.1s], memory [ >> 24.7gb]->[15gb]/[30.4gb], all_pools {[young] >> [10.8gb]->[47.8mb]/[12.8gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] >> [12.3gb]->[13.3gb]/[16gb]} >> >> 2016-08-22T15:03:51.784Z WARN [KafkaJournal] Committing journal offset >> spins 10 times now, this might be a bug. Continuing to try update. >> >> *2016-08-22T15:07:08.647Z WARN [jvm] >> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][old][477][7] duration >> [23.9s], collections [1]/[24.7s], total [23.9s]/[25.8s], mem* >> *ory [25.2gb]->[11.8gb]/[30.4gb], all_pools {[young] >> [10.2gb]->[32.4mb]/[12.8gb]}{[survivor] [411.3mb]->[0b]/[1.5gb]}{[old] >> [14.6gb]->[11.8gb]/[16gb]}* >> >> 2016-08-22T21:32:36.887Z WARN [NodePingThread] Did not find meta info of >> this node. Re-registering. >> 2016-08-22T21:33:15.752Z WARN [NodePingThread] Did not find meta info of >> this node. Re-registering. >> >> *2016-08-22T21:33:15.752Z WARN [jvm] >> [graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][old][501][37] duration >> [38.6s], collections [2]/[38.8s], total [38.6s]/[15.8m], memory >> [30.3gb]->[22.8gb]/[30.4gb], all_pools {[young] >> [12.8gb]->[6.8gb]/[12.8gb]}{[survivor] [1.5gb]->[0b]/[1.5gb]}{[old] >> [15.9gb]->[15.9gb]/[16gb]}* >> >> 2016-08-22T21:33:39.089Z WARN [NodePingThread] Did not find meta info of >> this node. Re-registering. >> 2016-08-22T21:34:27.644Z WARN [NodePingThread] Did not find meta info of >> this node. Re-registering. >> >> >> >> So I tried to gather some GClogs also attached to the post, if i do a >> jstat -gccause on one of the servers >> this is what i get: >> root@graylog1-cab1:~# jstat -gccause 2626 1000 >> S0 S1 E O M CCS YGC YGCT FGC >> FGCT GCT LGCC GCC >> 0.00 7.66 85.47 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.50 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.54 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.55 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.55 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.61 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.65 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.68 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.69 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.72 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.75 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.75 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.84 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.87 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.93 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.93 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 85.97 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 86.00 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 86.00 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 86.07 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 86.11 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> 0.00 7.66 86.17 89.72 98.31 - 132 27.511 7 >> 26.812 54.323 Allocation Failure No GC >> ^Croot@graylog1-cab1:~# >> >> root@graylog1-cab2:~# jstat -gccause 15247 1000 >> S0 S1 E O M CCS YGC YGCT FGC >> FGCT GCT LGCC GCC >> 16.00 0.00 2.34 82.67 97.76 - 3956 209.064 42 8.939 >> 218.002 Allocation Failure No GC >> 16.00 0.00 49.51 82.67 97.76 - 3956 209.064 42 8.939 >> 218.002 Allocation Failure No GC >> 16.00 0.00 100.00 82.67 97.76 - 3957 209.064 42 8.939 >> 218.002 Allocation Failure Allocation Failure >> 0.00 16.63 46.85 82.83 97.76 - 3957 209.175 42 8.939 >> 218.113 Allocation Failure No GC >> 18.22 18.59 99.98 83.17 97.76 - 3958 209.175 42 8.939 >> 218.113 No GC GCLocker Initiated GC >> 18.22 0.00 52.19 83.01 97.76 - 3958 209.285 42 8.939 >> 218.223 GCLocker Initiated GC No GC >> 0.00 16.02 2.33 83.19 97.76 - 3959 209.324 42 8.939 >> 218.263 Allocation Failure No GC >> 0.00 16.02 50.38 83.19 97.76 - 3959 209.324 42 8.939 >> 218.263 Allocation Failure No GC >> 0.00 16.02 97.35 83.19 97.76 - 3959 209.324 42 8.939 >> 218.263 Allocation Failure No GC >> 16.15 0.00 46.86 83.36 97.76 - 3960 209.363 42 8.939 >> 218.302 Allocation Failure No GC >> 18.10 15.06 99.80 83.56 97.76 - 3961 209.363 42 8.939 >> 218.302 No GC GCLocker Initiated GC >> 0.00 18.95 50.47 83.54 97.76 - 3961 209.462 42 8.939 >> 218.401 GCLocker Initiated GC No GC >> 14.51 0.00 2.39 83.72 97.76 - 3962 209.503 42 8.939 >> 218.441 Allocation Failure No GC >> 14.51 0.00 56.86 83.72 97.76 - 3962 209.503 42 8.939 >> 218.441 Allocation Failure No GC >> 0.00 16.35 9.47 83.90 97.76 - 3963 209.542 42 8.939 >> 218.480 Allocation Failure No GC >> 0.00 16.35 63.29 83.90 97.76 - 3963 209.542 42 8.939 >> 218.480 Allocation Failure No GC >> 15.13 0.00 14.54 84.10 97.76 - 3964 209.582 42 8.939 >> 218.521 Allocation Failure No GC >> 15.13 0.00 66.56 84.10 97.76 - 3964 209.582 42 8.939 >> 218.521 Allocation Failure No GC >> 0.00 16.20 18.57 84.28 97.76 - 3965 209.622 42 8.939 >> 218.561 Allocation Failure No GC >> 0.00 16.20 72.64 84.28 97.76 - 3965 209.622 42 8.939 >> 218.561 Allocation Failure No GC >> 16.43 0.00 23.21 84.48 97.76 - 3966 209.660 42 8.939 >> 218.599 Allocation Failure No GC >> 16.43 0.00 72.23 84.48 97.76 - 3966 209.660 42 8.939 >> 218.599 Allocation Failure No GC >> 0.00 13.79 24.92 84.65 97.76 - 3967 209.705 42 8.939 >> 218.643 Allocation Failure No GC >> 0.00 13.79 79.81 84.65 97.76 - 3967 209.705 42 8.939 >> 218.643 Allocation Failure No GC >> 16.05 0.00 32.03 84.82 97.76 - 3968 209.742 42 8.939 >> 218.681 Allocation Failure No GC >> 16.05 0.00 86.74 84.82 97.76 - 3968 209.742 42 8.939 >> 218.681 Allocation Failure No GC >> 0.00 15.02 32.71 84.98 97.76 - 3969 209.782 42 8.939 >> 218.720 Allocation Failure No GC >> ^Croot@graylog1-cab2:~# >> >> >> After seeing this error, I was thinking that probably the YOUNG ratio in >> graylog was non optimal, but it looks like it is using a 1:1 ratio so it >> has 16GB. >> >> At this point, I don't know what I could test, any ideas are welcome. >> >> Thanks >> Ricardo >> >> -- You received this message because you are subscribed to the Google Groups "Graylog Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to graylog2+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/graylog2/89104ce3-2099-4eb1-abf8-03b626ce1d23%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.