We have a 4 node (2 client only, 2 data/master nodes with 25G memory allocated to ES and 12 cores each) ES cluster, storing an index with 16 shards, ~200GB and 1 replica.
Recently running scan/scroll requests to dump data and other faceting requests, the nodes disconnected from each and we had a split condition. All requests were being run sequentially, not in parellel. From the logs we notice that for 5-10 minutes, just the young gc was running even though old memory and total memory used was over 75% (which I think is the default CMSInitiatingOccupancyFraction=75). When finally old gc ran, it took 1.7 minutes which caused nodes to disconnect (3 pings of 30 seconds failed). Below are some of the last traces from log - [2014-04-15 22:07:33,526][INFO ][monitor.jvm ] [ny1.node2] [gc][young][149283][9270] duration [782ms], collections [1]/[1s], total [782ms]/[8.2m], memory [21.5gb]->[20.8gb]/[24.8gb], all_pools {[young] [1gb]->[147mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old] [20 .3gb]->[20.5gb]/[23.5gb]} [2014-04-15 22:07:35,264][INFO ][monitor.jvm ] [ny1.node2] [gc][young][149284][9271] duration [743ms], collections [1]/[1.7s], total [743ms]/[8.2m], memory [20.8gb]->[20.8gb]/[24.8gb], all_pools {[young] [147mb]->[10.1mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old ] [20.5gb]->[20.6gb]/[23.5gb]} [2014-04-15 22:07:36,814][INFO ][monitor.jvm ] [ny1.node2] [gc][young][149285][9272] duration [786ms], collections [1]/[1.5s], total [786ms]/[8.2m], memory [20.8gb]->[20.9gb]/[24.8gb], all_pools {[young] [10.1mb]->[2.8mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old ] [20.6gb]->[20.8gb]/[23.5gb]} [2014-04-15 22:07:38,880][INFO ][monitor.jvm ] [ny1.node2] [gc][young][149287][9273] duration [835ms], collections [1]/[1s], total [835ms]/[8.2m], memory [21.5gb]->[21.1gb]/[24.8gb], all_pools {[young] [655.9mb]->[1.2mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old] [20.7gb]->[20.9gb]/[23.5gb]} [2014-04-15 22:09:24,215][INFO ][monitor.jvm ] [ny1.node2] [gc][young][149290][9274] duration [786ms], collections [1]/[1.7m], total [786ms]/[8.3m], memory [21.7gb]->[2.4gb]/[24.8gb], all_pools {[young] [727.2mb]->[13.8mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [2 0.8gb]->[2.4gb]/[23.5gb]} [2014-04-15 22:09:24,215][WARN ][monitor.jvm ] [ny1.node2] [gc][old][149290][25] duration [1.7m], collections [2]/[1.7m], total [1.7m]/[1.7m], memory [21.7gb]->[2.4gb]/[24.8gb], all_pools {[young] [727.2mb]->[13.8mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [20.8gb] ->[2.4gb]/[23.5gb]} CPU usage was pretty low on the machine, but it's confusing why the old gc was put on pause for so long and why did it take so much when it finally ran? We currently use Java 7 update 10, would the latest 51 help? Would switching to GC1 help ? Thanks! -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2bb88e62-4240-442e-b3aa-16cfd7817550%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.