[jira] [Commented] (STORM-2693) Topology submission or kill takes too much time when topologies grow to a few hundred
[ https://issues.apache.org/jira/browse/STORM-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180275#comment-16180275 ] Jungtaek Lim commented on STORM-2693: - [~danny0405] The patch is meant to be compatible with current. When we move to metrics V2 (also built-in) completely, we could get rid of metrics in heartbeat. Indeed you're right about current status, so let's forget about metrics V2 for now and go ahead. > Topology submission or kill takes too much time when topologies grow to a few > hundred > - > > Key: STORM-2693 > URL: https://issues.apache.org/jira/browse/STORM-2693 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Affects Versions: 0.9.6, 1.0.2, 1.1.0, 1.0.3 >Reporter: Yuzhao Chen > Labels: pull-request-available > Attachments: 2FA30CD8-AF15-4352-992D-A67BD724E7FB.png, > D4A30D40-25D5-4ACF-9A96-252EBA9E6EF6.png > > Time Spent: 5h > Remaining Estimate: 0h > > Now for a storm cluster with 40 hosts [with 32 cores/128G memory] and > hundreds of topologies, nimbus submission and killing will take about minutes > to finish. For example, for a cluster with 300 hundred of topologies,it will > take about 8 minutes to submit a topology, this affect our efficiency > seriously. > So, i check out the nimbus code and find two factor that will effect nimbus > submission/killing time for a scheduling round: > * read existing-assignments from zookeeper for every topology [will take > about 4 seconds for a 300 topologies cluster] > * read all the workers heartbeats and update the state to nimbus cache [will > take about 30 seconds for a 300 topologies cluster] > the key here is that Storm now use zookeeper to collect heartbeats [not RPC], > and also keep physical plan [assignments] using zookeeper which can be > totally local in nimbus. > So, i think we should make some changes to storm's heartbeats and assignments > management. > For assignment promotion: > 1. nimbus will put the assignments in local disk > 2. when restart or HA leader trigger nimbus will recover assignments from zk > to local disk > 3. nimbus will tell supervisor its assignment every time through RPC every > scheduling round > 4. supervisor will sync assignments at fixed time > For heartbeats promotion: > 1. workers will report executors ok or wrong to supervisor at fixed time > 2. supervisor will report workers heartbeats to nimbus at fixed time > 3. if supervisor die, it will tell nimbus through runtime hook > or let nimbus find it through aware supervisor if is survive > 4. let supervisor decide if worker is running ok or invalid , supervisor will > tell nimbus which executors of every topology are ok -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (STORM-2755) Sharing data between Bolts
[ https://issues.apache.org/jira/browse/STORM-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179307#comment-16179307 ] Stig Rohde Døssing commented on STORM-2755: --- Hi Oussama, Yes, this is what stream groupings are for. Please see https://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html under the "Stream Groupings" header. With an appropriate grouping you can make sure that all related data goes to the same bolt instance. I would recommend you start by looking at the fields grouping. If you'd like a code example, the WordCountTopology uses a field grouping here https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java#L88 > Sharing data between Bolts > -- > > Key: STORM-2755 > URL: https://issues.apache.org/jira/browse/STORM-2755 > Project: Apache Storm > Issue Type: Question >Reporter: Oussama BEN LAMINE > > Hello, > i am facing a problem and your help will be really appreciated. > so, i am using storm 1.0.1, i prepared a topology that transfers data throw a > preparation bolt then to a python prediction bolt. > in prediction bolt i am collecting data in real time and creating group of > data then sending them to prediction function. > this topology works perfectly in one machine where the machine can process > all the data, but not in distributed cluster as each machine work separately > and some data needed for grouping can be processed by different machine. > Is there a way that bolts communicate with each other and so we can group > data even if they are on different machines? > can you please advice and thank you -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (STORM-2757) Links are broken when logviewer https port is used
[ https://issues.apache.org/jira/browse/STORM-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Li updated STORM-2757: Description: Some links are broken when logviewer.https.port is configured. For example, with the configuration: {code:java} logviewer.https.port: 9093 logviewer.https.keystore.type: "JKS" logviewer.https.keystore.path: "/keystore-path" logviewer.https.keystore.password: "xx" logviewer.https.key.password: "xx" {code} We will get: [^screenshot-1.png] The logLink is still using http port. However, it's not reachable. was: Some links are broken when logviewer.https.port is configured. For example, with the configuration: {code:java} logviewer.https.port: 9093 logviewer.https.keystore.type: "JKS" logviewer.https.keystore.path: "/keystore-path" logviewer.https.keystore.password: "xx" logviewer.https.key.password: "xx" {code} > Links are broken when logviewer https port is used > -- > > Key: STORM-2757 > URL: https://issues.apache.org/jira/browse/STORM-2757 > Project: Apache Storm > Issue Type: Bug >Reporter: Ethan Li >Assignee: Ethan Li > Attachments: screenshot-1.png > > > Some links are broken when logviewer.https.port is configured. > For example, with the configuration: > {code:java} > logviewer.https.port: 9093 > logviewer.https.keystore.type: "JKS" > logviewer.https.keystore.path: "/keystore-path" > logviewer.https.keystore.password: "xx" > logviewer.https.key.password: "xx" > {code} > We will get: > [^screenshot-1.png] > The logLink is still using http port. However, it's not reachable. > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (STORM-2757) Links are broken when logviewer https port is used
[ https://issues.apache.org/jira/browse/STORM-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Li updated STORM-2757: Attachment: screenshot-1.png > Links are broken when logviewer https port is used > -- > > Key: STORM-2757 > URL: https://issues.apache.org/jira/browse/STORM-2757 > Project: Apache Storm > Issue Type: Bug >Reporter: Ethan Li >Assignee: Ethan Li > Attachments: screenshot-1.png > > > Some links are broken when logviewer.https.port is configured. > For example, with the configuration: > {code:java} > logviewer.https.port: 9093 > logviewer.https.keystore.type: "JKS" > logviewer.https.keystore.path: "/keystore-path" > logviewer.https.keystore.password: "xx" > logviewer.https.key.password: "xx" > {code} > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (STORM-2757) Links are broken when logviewer https port is used
Ethan Li created STORM-2757: --- Summary: Links are broken when logviewer https port is used Key: STORM-2757 URL: https://issues.apache.org/jira/browse/STORM-2757 Project: Apache Storm Issue Type: Bug Reporter: Ethan Li Assignee: Ethan Li Attachments: screenshot-1.png Some links are broken when logviewer.https.port is configured. For example, with the configuration: {code:java} logviewer.https.port: 9093 logviewer.https.keystore.type: "JKS" logviewer.https.keystore.path: "/keystore-path" logviewer.https.keystore.password: "xx" logviewer.https.key.password: "xx" {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (STORM-2083) Blacklist Scheduler
[ https://issues.apache.org/jira/browse/STORM-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated STORM-2083: -- Labels: blacklist pull-request-available scheduling (was: blacklist scheduling) > Blacklist Scheduler > --- > > Key: STORM-2083 > URL: https://issues.apache.org/jira/browse/STORM-2083 > Project: Apache Storm > Issue Type: New Feature > Components: storm-core >Reporter: Howard Lee > Labels: blacklist, pull-request-available, scheduling > Time Spent: 15h 10m > Remaining Estimate: 0h > > My company has gone through a fault in production, in which a critical switch > causes unstable network for a set of machines with package loss rate of > 30%-50%. In such fault, the supervisors and workers on the machines are not > definitely dead, which is easy to handle. Instead they are still alive but > very unstable. They lost heartbeat to the nimbus occasionally. The nimbus, in > such circumstance, will still assign jobs to these machines, but will soon > find them invalid again, result in a very slow convergence to stable status. > To deal with such unstable cases, we intend to implement a blacklist > scheduler, which will add the unstable nodes (supervisors, slots) to the > blacklist temporarily, and resume them later. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (STORM-2755) Sharing data between Bolts
[ https://issues.apache.org/jira/browse/STORM-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178710#comment-16178710 ] Oussama BEN LAMINE commented on STORM-2755: --- Good morning, any help please? > Sharing data between Bolts > -- > > Key: STORM-2755 > URL: https://issues.apache.org/jira/browse/STORM-2755 > Project: Apache Storm > Issue Type: Question >Reporter: Oussama BEN LAMINE > > Hello, > i am facing a problem and your help will be really appreciated. > so, i am using storm 1.0.1, i prepared a topology that transfers data throw a > preparation bolt then to a python prediction bolt. > in prediction bolt i am collecting data in real time and creating group of > data then sending them to prediction function. > this topology works perfectly in one machine where the machine can process > all the data, but not in distributed cluster as each machine work separately > and some data needed for grouping can be processed by different machine. > Is there a way that bolts communicate with each other and so we can group > data even if they are on different machines? > can you please advice and thank you -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (STORM-2693) Topology submission or kill takes too much time when topologies grow to a few hundred
[ https://issues.apache.org/jira/browse/STORM-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178703#comment-16178703 ] Yuzhao Chen commented on STORM-2693: [~kabhwan] I have read STORM-2153 and didn't find any promotion for heartbeats part. So i decide to re-design it which can be done individually apart from the metrics V2 thing. > Topology submission or kill takes too much time when topologies grow to a few > hundred > - > > Key: STORM-2693 > URL: https://issues.apache.org/jira/browse/STORM-2693 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Affects Versions: 0.9.6, 1.0.2, 1.1.0, 1.0.3 >Reporter: Yuzhao Chen > Labels: pull-request-available > Attachments: 2FA30CD8-AF15-4352-992D-A67BD724E7FB.png, > D4A30D40-25D5-4ACF-9A96-252EBA9E6EF6.png > > Time Spent: 5h > Remaining Estimate: 0h > > Now for a storm cluster with 40 hosts [with 32 cores/128G memory] and > hundreds of topologies, nimbus submission and killing will take about minutes > to finish. For example, for a cluster with 300 hundred of topologies,it will > take about 8 minutes to submit a topology, this affect our efficiency > seriously. > So, i check out the nimbus code and find two factor that will effect nimbus > submission/killing time for a scheduling round: > * read existing-assignments from zookeeper for every topology [will take > about 4 seconds for a 300 topologies cluster] > * read all the workers heartbeats and update the state to nimbus cache [will > take about 30 seconds for a 300 topologies cluster] > the key here is that Storm now use zookeeper to collect heartbeats [not RPC], > and also keep physical plan [assignments] using zookeeper which can be > totally local in nimbus. > So, i think we should make some changes to storm's heartbeats and assignments > management. > For assignment promotion: > 1. nimbus will put the assignments in local disk > 2. when restart or HA leader trigger nimbus will recover assignments from zk > to local disk > 3. nimbus will tell supervisor its assignment every time through RPC every > scheduling round > 4. supervisor will sync assignments at fixed time > For heartbeats promotion: > 1. workers will report executors ok or wrong to supervisor at fixed time > 2. supervisor will report workers heartbeats to nimbus at fixed time > 3. if supervisor die, it will tell nimbus through runtime hook > or let nimbus find it through aware supervisor if is survive > 4. let supervisor decide if worker is running ok or invalid , supervisor will > tell nimbus which executors of every topology are ok -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (STORM-2153) New Metrics Reporting API
[ https://issues.apache.org/jira/browse/STORM-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated STORM-2153: -- Labels: pull-request-available (was: ) > New Metrics Reporting API > - > > Key: STORM-2153 > URL: https://issues.apache.org/jira/browse/STORM-2153 > Project: Apache Storm > Issue Type: Improvement >Reporter: P. Taylor Goetz >Assignee: P. Taylor Goetz > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > This is a proposal to provide a new metrics reporting API based on [Coda > Hale's metrics library | http://metrics.dropwizard.io/3.1.0/] (AKA > Dropwizard/Yammer metrics). > h2. Background > In a [discussion on the dev@ mailing list | > http://mail-archives.apache.org/mod_mbox/storm-dev/201610.mbox/%3ccagx0urh85nfh0pbph11pmc1oof6htycjcxsxgwp2nnofukq...@mail.gmail.com%3e] > a number of community and PMC members recommended replacing Storm’s metrics > system with a new API as opposed to enhancing the existing metrics system. > Some of the objections to the existing metrics API include: > # Metrics are reported as an untyped Java object, making it very difficult to > reason about how to report it (e.g. is it a gauge, a counter, etc.?) > # It is difficult to determine if metrics coming into the consumer are > pre-aggregated or not. > # Storm’s metrics collection occurs through a specialized bolt, which in > addition to potentially affecting system performance, complicates certain > types of aggregation when the parallelism of that bolt is greater than one. > In the discussion on the developer mailing list, there is growing consensus > for replacing Storm’s metrics API with a new API based on Coda Hale’s metrics > library. This approach has the following benefits: > # Coda Hale’s metrics library is very stable, performant, well thought out, > and widely adopted among open source projects (e.g. Kafka). > # The metrics library provides many existing metric types: Meters, Gauges, > Counters, Histograms, and more. > # The library has a pluggable “reporter” API for publishing metrics to > various systems, with existing implementations for: JMX, console, CSV, SLF4J, > Graphite, Ganglia. > # Reporters are straightforward to implement, and can be reused by any > project that uses the metrics library (i.e. would have broader application > outside of Storm) > As noted earlier, the metrics library supports pluggable reporters for > sending metrics data to other systems, and implementing a reporter is fairly > straightforward (an example reporter implementation can be found here). For > example if someone develops a reporter based on Coda Hale’s metrics, it could > not only be used for pushing Storm metrics, but also for any system that used > the metrics library, such as Kafka. > h2. Scope of Effort > The effort to implement a new metrics API for Storm can be broken down into > the following development areas: > # Implement API for Storms internal worker metrics: latencies, queue sizes, > capacity, etc. > # Implement API for user defined, topology-specific metrics (exposed via the > {{org.apache.storm.task.TopologyContext}} class) > # Implement API for storm daemons: nimbus, supervisor, etc. > h2. Relationship to Existing Metrics > This would be a new API that would not affect the existing metrics API. Upon > completion, the old metrics API would presumably be deprecated, but kept in > place for backward compatibility. > Internally the current metrics API uses Storm bolts for the reporting > mechanism. The proposed metrics API would not depend on any of Storm's > messaging capabilities and instead use the [metrics library's built-in > reporter mechanism | > http://metrics.dropwizard.io/3.1.0/manual/core/#man-core-reporters]. This > would allow users to use existing {{Reporter}} implementations which are not > Storm-specific, and would simplify the process of collecting metrics. > Compared to Storm's {{IMetricCollector}} interface, implementing a reporter > for the metrics library is much more straightforward (an example can be found > [here | > https://github.com/dropwizard/metrics/blob/3.2-development/metrics-core/src/main/java/com/codahale/metrics/ConsoleReporter.java]. > The new metrics capability would not use or affect the ZooKeeper-based > metrics used by Storm UI. > h2. Relationship to JStorm Metrics > [TBD] > h2. Target Branches > [TBD] > h2. Performance Implications > [TBD] > h2. Metrics Namespaces > [TBD] > h2. Metrics Collected > *Worker* > || Namespace || Metric Type || Description || > *Nimbus* > || Namespace || Metric Type || Description || > *Supervisor* > || Namespace || Metric Type || Description || > h2. User-Defined Metrics > [TBD] -- This message was sent by Atlassian JIRA (v6.4.14#64029)