We have a 2-node Storm cluster in production (v0.9.0.1). The condition we're
noticing is that Storm Complete Latency varies throughout the day from 300ms to
1200ms. However during the same period, the sum of all Bolt Execute Latency
times only add up to about ** 50ms **. The same is true for the sum of all of
the Bolt Process Latency times. See snapshot of this illustrated below
(monospaced font works best), along with the full storm configuration.
We've tried doubling the topology workers and ackers, as well as increasing
tasks/parallelism for all bolts. Lastly we also tried to increase the
topology.executor send/receive/transfer buffer sizes (as included below).
However we're unable to account for most of the topology latency as reported
under Complete Latency. Should the sum of the bolt latencies closely match the
topology's Complete Latency? If so, what tuning parameters could we change to
help achieve this?
Topology stats (Storm UI)
Window Emitted Transferred Complete latency (ms) Acked
Failed
10m 0s 4502820 5912040 1009.565
260600 0
3h 0m 0s 71909220 94401060 731.852
4466280 0
1d 0h 0m 0s 461365780 604827120 486.754
31079420 0
All time 911575300 1194802200 483.995
61884080 0
Bolts (All time)
Id Executors Tasks Emitted Transferred Capacity
Execute latency Executed Process latency (ms)
writerBolt 12 12 0 0 0.479
1.246 283204540 1.244
udsEnhancement 400 400 283242820 566485640 0.872
47.420 283204720 47.397
transformBolt 8 8 283168440 283168440 0.772
1.085 283205380 1.085
realtimePublisher 8 8 0 0 0.533
0.834 283204360 0.831
parserBolt 20 20 283254740 283238820 0.745
8.392 61884380 8.409
Full Configuration Settings:
medio.storm.debug false
medio.storm.events.max_future_allowed_hours 24
medio.storm.events.max_past_allowed_hours 720
medio.storm.max_spout_pending 100
medio.storm.max_task_parallelism 5
medio.storm.message_timeout_seconds 180
medio.storm.num_ackers 4
medio.storm.num_workers 4
medio.storm.topology_name avalanche-core
medio.topology.connect_kafka_logging_bolt false
medio.topology.emit_mock_kafka_requests false
medio.topology.version 2.4.6
nimbus.childopts -Xmx1024m
nimbus.cleanup.inbox.freq.secs 600
nimbus.file.copy.expiration.secs 600
nimbus.host m3web01-sef.prod.msrch
nimbus.inbox.jar.expiration.secs 3600
nimbus.monitor.freq.secs 10
nimbus.reassign true
nimbus.supervisor.timeout.secs 60
nimbus.task.launch.secs 120
nimbus.task.timeout.secs 30
nimbus.thrift.port 6627
nimbus.topology.validator backtype.storm.nimbus.DefaultTopologyValidator
storm.cluster.mode distributed
storm.id avalanche-core-3-1415647968
storm.local.dir /workplace/storm/local
storm.local.mode.zmq false
storm.messaging.netty.buffer_size 5242880
storm.messaging.netty.client_worker_threads 1
storm.messaging.netty.max_retries 30
storm.messaging.netty.max_wait_ms 1000
storm.messaging.netty.min_wait_ms 100
storm.messaging.netty.server_worker_threads 1
storm.messaging.transport backtype.storm.messaging.zmq
storm.thrift.transport backtype.storm.security.auth.SimpleTransportPlugin
storm.zookeeper.connection.timeout 15000
storm.zookeeper.port 2181
storm.zookeeper.retry.interval 1000
storm.zookeeper.retry.intervalceiling.millis 30000
storm.zookeeper.retry.times 5
storm.zookeeper.root /storm
storm.zookeeper.servers ["zk01-sef" "zk02-sef" "zk03-sef"]
storm.zookeeper.session.timeout 20000
supervisor.childopts -Xmx256m
supervisor.enable true
supervisor.heartbeat.frequency.secs 5
supervisor.monitor.frequency.secs 3
supervisor.slots.ports [6700 6701 6702 6703]
supervisor.worker.start.timeout.secs 120
supervisor.worker.timeout.secs 30
task.heartbeat.frequency.secs 3
task.refresh.poll.secs 10
topology.acker.executors 4
topology.builtin.metrics.bucket.size.secs 60
topology.debug false
topology.disruptor.wait.strategy com.lmax.disruptor.BlockingWaitStrategy
topology.enable.message.timeouts true
topology.error.throttle.interval.secs 10
topology.executor.receive.buffer.size 16384
topology.executor.send.buffer.size 16384
topology.fall.back.on.java.serialization false
topology.kryo.decorators []
topology.kryo.factory backtype.storm.serialization.DefaultKryoFactory
topology.kryo.register
{"com.medio.services.avalanche.model.impl.RoutableEventImpl"
"com.medio.services.avalanche.storm.common.serializers.RoutableEventSerializer"}
topology.max.error.report.per.interval 5
topology.max.spout.pending 100
topology.max.task.parallelism
topology.message.timeout.secs 180
topology.metrics.consumer.register [{"argument" nil, "parallelism.hint" 2,
"class" "com.medio.services.tempest.client.storm.OpenTSDBMetricConsumer"}]
topology.name avalanche-core
topology.optimize true
topology.receiver.buffer.size 8
topology.skip.missing.kryo.registrations false
topology.sleep.spout.wait.strategy.time.ms 1
topology.spout.wait.strategy backtype.storm.spout.SleepSpoutWaitStrategy
topology.state.synchronization.timeout.secs 60
topology.stats.sample.rate 0.05
topology.tasks
topology.tick.tuple.freq.secs
topology.transfer.buffer.size 32
topology.trident.batch.emit.interval.millis 500
topology.tuple.serializer
backtype.storm.serialization.types.ListDelegateSerializer
topology.worker.childopts
topology.worker.shared.thread.pool.size 4
topology.workers 4
medio.bolt.event_parser.num_tasks 20
medio.bolt.event_parser.parallelism_hint 20
medio.bolt.event_transform.enabled true
medio.bolt.event_transform.num_tasks 8
medio.bolt.event_transform.parallelism_hint 8
medio.bolt.event_transform.refresh_interval 60000
medio.bolt.event_transform.root_directory /usr/medio/avalanche-storm-transforms/
medio.bolt.event_writer.num_tasks 12
medio.bolt.event_writer.parallelism_hint 12
medio.bolt.event_writer.upload_notification.batch_expire_ms 5000
medio.bolt.event_writer.upload_notification.batch_size 200
medio.bolt.event_writer.upload_notification.compression_codec 0
medio.bolt.event_writer.upload_notification.max_queue_size 5000
medio.bolt.event_writer.upload_notification.topic avalancheFileUploads
medio.bolt.event_writer.upload_notification.zk_connect
medio.bolt.kafka_logging.num_tasks 1
medio.bolt.kafka_logging.parallelism_hint 1
medio.bolt.location.blacklisted_events
medio.bolt.location.enabled false
medio.bolt.location.magellan.host 10.10.32.249:8904
medio.bolt.location.magellan.timeout_ms 5000
medio.bolt.location.num_tasks 20
medio.bolt.location.parallelism_hint 20
medio.bolt.location.whitelisted_events
medio.bolt.realtime_publisher.enabled true
medio.bolt.realtime_publisher.kafka.batch_expire_ms 5000
medio.bolt.realtime_publisher.kafka.batch_size 2000
medio.bolt.realtime_publisher.kafka.compression_codec 1
medio.bolt.realtime_publisher.kafka.max_queue_size 100000
medio.bolt.realtime_publisher.kafka.topic realtimeEvents
medio.bolt.realtime_publisher.kafka.zk_connect
zk01-sef,zk02-sef,zk03-sef:2181/kafka/prod_cluster
medio.bolt.realtime_publisher.num_tasks 8
medio.bolt.realtime_publisher.parallelism_hint 8
medio.bolt.realtime_publisher.rabbit.enabled false
medio.bolt.realtime_publisher.rabbit.exchange_name realtimeEvents
medio.bolt.realtime_publisher.rabbit.host dlc02-sea
medio.bolt.realtime_publisher.whitelisted_events *:*:*
medio.bolt.uds_enhancement.ccp.service.root http://ccp.medio.com/ccp-ws/v1/
medio.bolt.uds_enhancement.ccpclient.connectionTimeout 3000
medio.bolt.uds_enhancement.enabled true
medio.bolt.uds_enhancement.insert_medio_id.default_uds_mapping anon_id->anon-id
: user_id -> customer-id : device_id -> device-id : MSISDN -> msisdn
medio.bolt.uds_enhancement.num_tasks 400
medio.bolt.uds_enhancement.parallelism_hint 400
medio.bolt.uds_enhancement.uss.service.root http://sefuss:8422/uss-ws/v1/
medio.bolt.uds_enhancement.whitelisted_events
*:*:medio.botl.uds_enhancement.ccpclient.root http://ccp.medio.com/ccp-ws/v1/?