New question #229112 on Graphite: https://answers.launchpad.net/graphite/+question/229112
Hello! We got pretty powerful server (8 cores/300GB RAM/fast storage) for Graphite, having one relay and 4 cache instances with consistent hashing, no aggregators used. Graphite-web with Apache live also on same server. Now we have about ~850K metrics/min coming and ~20-30K cache queries/min according so stats. Most time server working fine, but latest time we start having some problem. It looks like 1, or 2 cache instances suddenly stops working - we get drops in graphs. In relay log we see: ========================================= 17/05/2013 14:41:44 :: [console] Starting factory CarbonClientFactory(127.0.0.1:2204:b) 17/05/2013 14:41:44 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::startedConnecting (127.0.0.1:2204) 17/05/2013 14:41:44 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionMade 17/05/2013 14:41:44 :: [listener] MetricLineReceiver connection with 10.32.232.11:47637 established 17/05/2013 14:41:44 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionLost Connection was closed cleanly. 17/05/2013 14:41:44 :: [console] <twisted.internet.tcp.Connector instance at 0x2fafab8> will retry in 5 seconds 17/05/2013 14:41:44 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::clientConnectionLost (127.0.0.1:2204) Connection was closed cleanly. 17/05/2013 14:41:44 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b) ========================================= repeating continuosly. Cache log file: ========================================= 17/05/2013 14:26:39 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b) 17/05/2013 14:26:58 :: [console] Starting factory CarbonClientFactory(127.0.0.1:2204:b) 17/05/2013 14:26:58 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::startedConnecting (127.0.0.1:2204) 17/05/2013 14:27:32 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionLost Connection was closed cleanly. 17/05/2013 14:27:32 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::clientConnectionLost (127.0.0.1:2204) Connection was closed cleanly. 17/05/2013 14:27:32 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b) ========================================= I.e. cache instance reconnecting to relay continously, but for some reason without success. Error logs are empty. Restarting of cache instance did not helps, only after restarting relay it normalizes, but repeating after 5-10 hours. Maybe we need performance problems, but system looks quite idle: ========================================= top - 15:12:44 up 22:32, 9 users, load average: 6.58, 7.15, 7.01 Tasks: 269 total, 3 running, 266 sleeping, 0 stopped, 0 zombie Cpu(s): 17.6%us, 0.8%sy, 0.0%ni, 69.6%id, 11.9%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 297175992k total, 39414336k used, 257761656k free, 263712k buffers Swap: 2568188k total, 0k used, 2568188k free, 30566700k cached ========================================= and also another server with similar configuration, but 24GB ram and slower storage working fine on 400K metrics/min without any problems... Configs are below: carbon.conf ========================================= [cache] LOG_DIR = /opt/graphite/log USER = MAX_CACHE_SIZE = inf MAX_UPDATES_PER_SECOND = 500 MAX_CREATES_PER_MINUTE = 5000 LINE_RECEIVER_INTERFACE = 0.0.0.0 ENABLE_UDP_LISTENER = False UDP_RECEIVER_INTERFACE = 0.0.0.0 UDP_RECEIVER_PORT = 2003 USE_INSECURE_UNPICKLER = False CACHE_QUERY_INTERFACE = 0.0.0.0 USE_FLOW_CONTROL = True LOG_UPDATES = False WHISPER_AUTOFLUSH = True WHISPER_LOCK_WRITES = True USE_WHITELIST = True [cache:a] LINE_RECEIVER_PORT = 2103 PICKLE_RECEIVER_PORT = 2104 CACHE_QUERY_PORT = 7102 [cache:b] LINE_RECEIVER_PORT = 2203 PICKLE_RECEIVER_PORT = 2204 CACHE_QUERY_PORT = 7202 [cache:c] LINE_RECEIVER_PORT = 2303 PICKLE_RECEIVER_PORT = 2304 CACHE_QUERY_PORT = 7302 [cache:d] LINE_RECEIVER_PORT = 2403 PICKLE_RECEIVER_PORT = 2404 CACHE_QUERY_PORT = 7402 [relay] USER = LINE_RECEIVER_INTERFACE = 0.0.0.0 LINE_RECEIVER_PORT = 2003 PICKLE_RECEIVER_INTERFACE = 0.0.0.0 PICKLE_RECEIVER_PORT = 2004 RELAY_METHOD = consistent-hashing REPLICATION_FACTOR = 1 DESTINATIONS = 127.0.0.1:2104:a, 127.0.0.1:2204:b, 127.0.0.1:2304:c, 127.0.0.1:2404:d MAX_DATAPOINTS_PER_MESSAGE = 50000 MAX_QUEUE_SIZE = 500000 USE_FLOW_CONTROL = True [aggregator] USER = LINE_RECEIVER_INTERFACE = 0.0.0.0 LINE_RECEIVER_PORT = 2023 PICKLE_RECEIVER_INTERFACE = 0.0.0.0 PICKLE_RECEIVER_PORT = 2024 DESTINATIONS = 127.0.0.1:2104:a REPLICATION_FACTOR = 1 MAX_QUEUE_SIZE = 200000 USE_FLOW_CONTROL = True MAX_DATAPOINTS_PER_MESSAGE = 500 MAX_AGGREGATION_INTERVALS = 5 USE_WHITELIST = True storage-schema.conf ========================================= [carbon] pattern = ^carbon\. retentions = 60:90d [default_1_min_30_days_15_min_1_year_1hour_5years_24hours_10years] priority = 100 pattern = .* retentions = 60:30d,900:1y,3600:5y,90000:10y blacklist.conf ========================================= .*5MinuteRate .*75percentile .*98percentile .*99percentile .*999percentile -- You received this question notification because you are a member of graphite-dev, which is an answer contact for Graphite. _______________________________________________ Mailing list: https://launchpad.net/~graphite-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~graphite-dev More help : https://help.launchpad.net/ListHelp

