Hey Jochen,
tried with Xmx=30Gb

changed the GC algo and i still have big pauses...

2016-08-24T17:06:30.042Z WARN  [GarbageCollectionWarningThread] Last GC run 
with G1 Young Generation took longer than 1 second (last duration=10443 
milliseconds)
2016-08-24T17:06:30.042Z WARN  [jvm] 
[graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][young][65][35] duration 
[10.4s], collections [1]/[10.6s], total [10.4s]/[36.9s], memory 
[29.2gb]->[29.3gb]/[30gb], all_pools {[young] 
[13.1gb]->[8mb]/[0b]}{[survivor] [56mb]->[152mb]/[0b]}{[old] 
[15.9gb]->[29.1gb]/[30gb]}

2016-08-24T17:06:30.391Z WARN  [NodePingThread] Did not find meta info of 
this node. Re-registering.
2016-08-24T17:06:51.065Z WARN  [NodePingThread] Did not find meta info of 
this node. Re-registering.
2016-08-24T17:06:51.065Z WARN  [jvm] 
[graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][old][66][2] duration 
[18.8s], collections [1]/[20.9s], total [18.8s]/[41.6s], memory 
[29.3gb]->[8.5gb]/[30gb], all_pools {[young] [8mb]->[0b]/[0b]}{[survivor] 
[152mb]->[0b]/[0b]}{[old] [29.1gb]->[8.5gb]/[30gb]}

2016-08-24T17:06:51.276Z WARN  [GarbageCollectionWarningThread] Last GC run 
with G1 Young Generation took longer than 1 second (last duration=1299 
milliseconds)
2016-08-24T17:06:51.284Z WARN  [GarbageCollectionWarningThread] Last GC run 
with G1 Old Generation took longer than 1 second (last duration=18865 
milliseconds)

Will let it run for a hour then ill post the GC logs here


On Wednesday, 24 August 2016 14:12:45 UTC+1, Jochen Schalanda wrote:
>
> Hi Ricardo,
>
> try configuring *less* heap memory for your JVM, ideally less than 32G. 
> See 
> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
>  
> for details.
>
> Cheers,
> Jochen
>
> On Wednesday, 24 August 2016 15:02:10 UTC+2, Ricardo Ferreira wrote:
>>
>> So, we have a 2 identical node graylog setup:
>>
>>
>>  ProLiant DL380 Gen9 (719061-B21) 
>>  32: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>>  128GB RAM
>>  20MiB L3 cache
>>  2147479552 bytes (2.1 GB) copied, 12.8211 s, 167 MB/s
>>
>> 016-08-24T10:46:34.255Z INFO  [CmdLineTool] Running with JVM arguments: 
>> -Xms32g -Xmx32g -XX:NewRatio=1 -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC 
>> -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC 
>> -XX:-OmitStackTraceInFastThrow -Xloggc:/tmp/memoryGC.log 
>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
>> -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml 
>> -Djava.library.path=/usr/share/graylog-server/lib/sigar 
>> -Dgraylog2.installation_source=deb
>>
>> different from default configs:
>>
>>
>> elasticsearch_shards = 5
>> elasticsearch_replicas = 1
>>
>> processbuffer_processors = 15
>> outputbuffer_processors = 12
>> ring_size = 4194304
>> inputbuffer_ring_size = 4194304
>> inputbuffer_processors = 6
>> inputbuffer_wait_strategy = blocking
>> message_journal_max_size = 30gb
>> async_eventbus_processors = 4
>>
>> So our environment does ~ 30-50GB per hour in this testing phase we are 
>> at ~5kmsg/s
>> If you look at the graphs we started doing a run just into one done and 
>> was changed to another node around 11:50UTC
>>
>> While it was running i noticed that the graylog node was de registering 
>> himself:
>>
>> 2016-08-24T10:46:57.254Z WARN  [jvm] 
>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][8][4] duration 
>> [3.1s], collections [1]/[3.9s], total [3.1s]/[3.8s], memory 
>> [9gb]->[2gb]/[30.4gb], all_pools {[young] 
>> [7.8gb]->[445.1mb]/[12.8gb]}{[survivor] [1.2gb]->[886.4mb]/[1.5gb]}{[old] 
>> [0b]->[770.9mb]/[16gb]}
>> 2016-08-24T10:46:57.374Z INFO  [connection] Opened connection 
>> [connectionId{localValue:22, serverValue:790}] to 172.24.9.59:27017
>> 2016-08-24T10:46:57.402Z INFO  [connection] Opened connection 
>> [connectionId{localValue:21, serverValue:789}] to 172.24.9.59:27017
>> 2016-08-24T10:46:57.416Z INFO  [connection] Opened connection 
>> [connectionId{localValue:23, serverValue:791}] to 172.24.9.59:27017
>> 2016-08-24T10:47:03.900Z INFO  [jvm] 
>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][13][9] duration 
>> [1.5s], collections [2]/[2.4s], total [1.5s]/[6s], memory 
>> [7.9gb]->[4.2gb]/[30.4gb], all_pools {[young] 
>> [5.2gb]->[1.5gb]/[12.8gb]}{[survivor] [1.4gb]->[599.3mb]/[1.5gb]}{[old] 
>> [1.1gb]->[2.9gb]/[16gb]}
>> 2016-08-24T10:47:08.949Z INFO  [jvm] 
>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][17][12] duration 
>> [1.5s], collections [2]/[2s], total [1.5s]/[7.7s], memory 
>> [13.1gb]->[5.8gb]/[30.4gb], all_pools {[young] 
>> [8.9gb]->[910.8mb]/[12.8gb]}{[survivor] [1.2gb]->[102.5mb]/[1.5gb]}{[old] 
>> [2.9gb]->[4.8gb]/[16gb]}
>> 2016-08-24T10:47:11.874Z INFO  [connection] Opened connection 
>> [connectionId{localValue:28, serverValue:794}] to 172.24.9.59:27017
>> 2016-08-24T10:47:11.870Z INFO  [connection] Opened connection 
>> [connectionId{localValue:29, serverValue:797}] to 172.24.9.59:27017
>> 2016-08-24T10:47:11.878Z INFO  [connection] Opened connection 
>> [connectionId{localValue:24, serverValue:793}] to 172.24.9.59:27017
>> 2016-08-24T10:47:11.878Z INFO  [connection] Opened connection 
>> [connectionId{localValue:27, serverValue:796}] to 172.24.9.59:27017
>> 2016-08-24T10:47:11.878Z INFO  [connection] Opened connection 
>> [connectionId{localValue:25, serverValue:792}] to 172.24.9.59:27017
>> 2016-08-24T10:47:11.876Z INFO  [connection] Opened connection 
>> [connectionId{localValue:26, serverValue:795}] to 172.24.9.59:27017
>> 2016-08-24T10:47:20.046Z WARN  [jvm] 
>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][25][19] duration 
>> [2s], collections [2]/[2.4s], total [2s]/[12.5s], memory 
>> [18.8gb]->[11.3gb]/[30.4gb], all_pools {[young] 
>> [9.6gb]->[360.1mb]/[12.8gb]}{[survivor] [1.5gb]->[813.1mb]/[1.5gb]}{[old] 
>> [7.6gb]->[10.2gb]/[16gb]}
>> 2016-08-24T10:47:25.065Z WARN  [jvm] 
>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][29][21]* 
>> duration [1.3s]*, collections [1]/[2s], total [1.3s]/[14.1s], memory 
>> [20.3gb]->[13.5gb]/[30.4gb], all_pools {[young] 
>> [8.4gb]->[161.9mb]/[12.8gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] 
>> [10.2gb]->[11.8gb]/[16gb]}
>>
>>
>>
>>
>>
>>
>> *2016-08-24T10:48:02.521Z WARN  [NodePingThread] Did not find meta info 
>> of this node. Re-registering.nevertheless i had other runs when this 
>> problem manifests it self:*
>> 2016-08-22T14:59:48.034Z INFO  [jvm] 
>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][62][18] duration 
>> [797ms], collections [1]/[1.5s], total [797ms]/[18.1s], memory [
>> 24.7gb]->[15gb]/[30.4gb], all_pools {[young] 
>> [10.8gb]->[47.8mb]/[12.8gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] 
>> [12.3gb]->[13.3gb]/[16gb]}
>>
>> 2016-08-22T15:03:51.784Z WARN  [KafkaJournal] Committing journal offset 
>> spins 10 times now, this might be a bug. Continuing to try update.
>>
>> *2016-08-22T15:07:08.647Z WARN  [jvm] 
>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][old][477][7] duration 
>> [23.9s], collections [1]/[24.7s], total [23.9s]/[25.8s], mem*
>> *ory [25.2gb]->[11.8gb]/[30.4gb], all_pools {[young] 
>> [10.2gb]->[32.4mb]/[12.8gb]}{[survivor] [411.3mb]->[0b]/[1.5gb]}{[old] 
>> [14.6gb]->[11.8gb]/[16gb]}*
>>
>> 2016-08-22T21:32:36.887Z WARN  [NodePingThread] Did not find meta info of 
>> this node. Re-registering.
>> 2016-08-22T21:33:15.752Z WARN  [NodePingThread] Did not find meta info of 
>> this node. Re-registering.
>>
>> *2016-08-22T21:33:15.752Z WARN  [jvm] 
>> [graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][old][501][37] duration 
>> [38.6s], collections [2]/[38.8s], total [38.6s]/[15.8m], memory 
>> [30.3gb]->[22.8gb]/[30.4gb], all_pools {[young] 
>> [12.8gb]->[6.8gb]/[12.8gb]}{[survivor] [1.5gb]->[0b]/[1.5gb]}{[old] 
>> [15.9gb]->[15.9gb]/[16gb]}*
>>
>> 2016-08-22T21:33:39.089Z WARN  [NodePingThread] Did not find meta info of 
>> this node. Re-registering.
>> 2016-08-22T21:34:27.644Z WARN  [NodePingThread] Did not find meta info of 
>> this node. Re-registering.
>>
>>
>>
>> So I tried to gather some GClogs also attached to the post, if i do a 
>> jstat -gccause on one of the servers
>> this is what i get:
>> root@graylog1-cab1:~# jstat -gccause 2626 1000
>>   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    
>> FGCT     GCT    LGCC                 GCC
>>   0.00   7.66  85.47  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.50  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.54  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.55  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.55  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.61  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.65  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.68  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.69  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.72  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.75  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.75  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.84  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.87  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.93  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.93  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  85.97  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  86.00  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  86.00  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  86.07  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  86.11  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>>   0.00   7.66  86.17  89.72  98.31      -    132   27.511     7   
>> 26.812   54.323 Allocation Failure   No GC
>> ^Croot@graylog1-cab1:~#
>>
>> root@graylog1-cab2:~# jstat -gccause 15247 1000
>>   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    
>> FGCT     GCT    LGCC                 GCC
>>  16.00   0.00   2.34  82.67  97.76      -   3956  209.064    42    8.939  
>> 218.002 Allocation Failure   No GC
>>  16.00   0.00  49.51  82.67  97.76      -   3956  209.064    42    8.939  
>> 218.002 Allocation Failure   No GC
>>  16.00   0.00 100.00  82.67  97.76      -   3957  209.064    42    8.939  
>> 218.002 Allocation Failure   Allocation Failure
>>   0.00  16.63  46.85  82.83  97.76      -   3957  209.175    42    8.939  
>> 218.113 Allocation Failure   No GC
>>  18.22  18.59  99.98  83.17  97.76      -   3958  209.175    42    8.939  
>> 218.113 No GC                GCLocker Initiated GC
>>  18.22   0.00  52.19  83.01  97.76      -   3958  209.285    42    8.939  
>> 218.223 GCLocker Initiated GC No GC
>>   0.00  16.02   2.33  83.19  97.76      -   3959  209.324    42    8.939  
>> 218.263 Allocation Failure   No GC
>>   0.00  16.02  50.38  83.19  97.76      -   3959  209.324    42    8.939  
>> 218.263 Allocation Failure   No GC
>>   0.00  16.02  97.35  83.19  97.76      -   3959  209.324    42    8.939  
>> 218.263 Allocation Failure   No GC
>>  16.15   0.00  46.86  83.36  97.76      -   3960  209.363    42    8.939  
>> 218.302 Allocation Failure   No GC
>>  18.10  15.06  99.80  83.56  97.76      -   3961  209.363    42    8.939  
>> 218.302 No GC                GCLocker Initiated GC
>>   0.00  18.95  50.47  83.54  97.76      -   3961  209.462    42    8.939  
>> 218.401 GCLocker Initiated GC No GC
>>  14.51   0.00   2.39  83.72  97.76      -   3962  209.503    42    8.939  
>> 218.441 Allocation Failure   No GC
>>  14.51   0.00  56.86  83.72  97.76      -   3962  209.503    42    8.939  
>> 218.441 Allocation Failure   No GC
>>   0.00  16.35   9.47  83.90  97.76      -   3963  209.542    42    8.939  
>> 218.480 Allocation Failure   No GC
>>   0.00  16.35  63.29  83.90  97.76      -   3963  209.542    42    8.939  
>> 218.480 Allocation Failure   No GC
>>  15.13   0.00  14.54  84.10  97.76      -   3964  209.582    42    8.939  
>> 218.521 Allocation Failure   No GC
>>  15.13   0.00  66.56  84.10  97.76      -   3964  209.582    42    8.939  
>> 218.521 Allocation Failure   No GC
>>   0.00  16.20  18.57  84.28  97.76      -   3965  209.622    42    8.939  
>> 218.561 Allocation Failure   No GC
>>   0.00  16.20  72.64  84.28  97.76      -   3965  209.622    42    8.939  
>> 218.561 Allocation Failure   No GC
>>  16.43   0.00  23.21  84.48  97.76      -   3966  209.660    42    8.939  
>> 218.599 Allocation Failure   No GC
>>  16.43   0.00  72.23  84.48  97.76      -   3966  209.660    42    8.939  
>> 218.599 Allocation Failure   No GC
>>   0.00  13.79  24.92  84.65  97.76      -   3967  209.705    42    8.939  
>> 218.643 Allocation Failure   No GC
>>   0.00  13.79  79.81  84.65  97.76      -   3967  209.705    42    8.939  
>> 218.643 Allocation Failure   No GC
>>  16.05   0.00  32.03  84.82  97.76      -   3968  209.742    42    8.939  
>> 218.681 Allocation Failure   No GC
>>  16.05   0.00  86.74  84.82  97.76      -   3968  209.742    42    8.939  
>> 218.681 Allocation Failure   No GC
>>   0.00  15.02  32.71  84.98  97.76      -   3969  209.782    42    8.939  
>> 218.720 Allocation Failure   No GC
>> ^Croot@graylog1-cab2:~#
>>
>>
>> After seeing this error, I was thinking that probably the YOUNG ratio in 
>> graylog was non optimal, but it looks like it is using a 1:1 ratio so it 
>> has 16GB.
>>
>> At this point, I don't know what I could test, any ideas are welcome.
>>
>> Thanks
>> Ricardo
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/89104ce3-2099-4eb1-abf8-03b626ce1d23%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to