[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287775#comment-14287775 ] ASF GitHub Bot commented on STORM-636: -- GitHub user d2r opened a pull request: https://github.com/apache/storm/pull/392 [STORM-636] Faster, optional retrieval of last component error We want to speed up interactions with the topology that call getTopologyInfo, including the Topology Page in the UI. Before change: - Errors written to /errors/topo-id/comp-name/eN, where N is a sequence number. - getTopologyInfo grabs all /errors/topo-id/*/* errors from ZK After this change: - Errors written to /errors/topo-id/comp-name/eN, where N is a sequence number. (unchanged) - Errors also written to /errors/topo-id/comp-name-last-error (this means there is an extra ZK write per error) - New Nimbus thrift interface method `getTopologyInfoWithOpts` - Only option right now is how many errors to retrieve from ZK: 0, 1, or all. You can merge this pull request into a Git repository by running: $ git pull https://github.com/d2r/storm storm-636-ui-errors Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/392.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #392 commit 1cfa190f2efb06f8798984b43dec801e5ff20ad5 Author: Derek Dagit Date: 2015-01-22T16:46:03Z Faster, optional retrieval of last component error > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287803#comment-14287803 ] ASF GitHub Bot commented on STORM-636: -- Github user d2r commented on the pull request: https://github.com/apache/storm/pull/392#issuecomment-71059798 For testing this I did the following: I modified the ExclamationTopology as follows: - bolts report a new RuntimeException via reportError on each call to execute - changed topology.max.error.report.per.interval: 100 (default 10), so each executor/thread will write up to 100 errors to ZK every 10 seconds. This is a quick-and-dirty way of increasing write load on ZK for testing. - Replicated the Exclamation bolt an additional ~450 times, so there are many more bolts in the topology. Procedure: - Launch the topology in Inactive state, wait for it to get assigned and stable - go to the topology page and activate the topology (at this point the load time is on the order of seconds) - Refresh the topology until I see fresh errors reported for each bolt - Wait a a minute - Refresh the page and record load times from the Firefox network inspector. - Repeat 4 more times Results: Before patch (median about a minute): - 57s - 1m13s - 55s - 51s - 1m34s With patch (median about 10s) - 12s - 6s - 8s - 13s - 10s > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287845#comment-14287845 ] ASF GitHub Bot commented on STORM-636: -- Github user Parth-Brahmbhatt commented on a diff in the pull request: https://github.com/apache/storm/pull/392#discussion_r23393694 --- Diff: storm-core/src/storm.thrift --- @@ -243,6 +243,16 @@ struct SubmitOptions { 2: optional Credentials creds; } +enum NumErrorsChoice { + ALL, + NONE, + ONE +} + +struct GetInfoOptions { + 1: optional NumErrorsChoice num_err_choice; --- End diff -- Instead of an Enum don't you think a pagination struct with a start and end would be more flexible? > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293882#comment-14293882 ] ASF GitHub Bot commented on STORM-636: -- Github user d2r commented on a diff in the pull request: https://github.com/apache/storm/pull/392#discussion_r23627700 --- Diff: storm-core/src/storm.thrift --- @@ -243,6 +243,16 @@ struct SubmitOptions { 2: optional Credentials creds; } +enum NumErrorsChoice { + ALL, + NONE, + ONE +} + +struct GetInfoOptions { + 1: optional NumErrorsChoice num_err_choice; --- End diff -- I hadn't thought about it that way, but I'm open to it. Could you give an example of what you mean? > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293889#comment-14293889 ] ASF GitHub Bot commented on STORM-636: -- Github user d2r commented on a diff in the pull request: https://github.com/apache/storm/pull/392#discussion_r23628156 --- Diff: storm-core/src/storm.thrift --- @@ -243,6 +243,16 @@ struct SubmitOptions { 2: optional Credentials creds; } +enum NumErrorsChoice { + ALL, + NONE, + ONE +} + +struct GetInfoOptions { + 1: optional NumErrorsChoice num_err_choice; --- End diff -- I should add, the current ZK storage of errors isn't very flexible. We store at most 10 errors per component (hard-coded). With only 10 errors to work with, I am not sure how much more flexibility we get by doing pagination. I guess this hard-coded limit of 10 is why I did not pursue pagination—it is just easier to give all 10 errors in the case that we want more than 1. > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293979#comment-14293979 ] ASF GitHub Bot commented on STORM-636: -- Github user Parth-Brahmbhatt commented on the pull request: https://github.com/apache/storm/pull/392#issuecomment-71706395 I did not know we only stored 10 errors, makes sense to not go with pagination if its hardcoded to 10. I will take a look at the complete PR today. Not to derail the discussion but personally, I would much rather not store errors in zk at all if its just for rendering the errors in UI. If the spouts/bolts could just store this in memory with some expiration that should suffice and we could expose an API at worker layer to get this information directly from it. If the host dies you lose some errors but that does not seem like a big deal. The only downside will be ui would now have to make requests against worker hosts to get erros but that seems ok to me, you would also get parallelism as all these worker calls can be made in parallel. I haven't thought this through completely and its probably much more work but I would love to hear your opinion. > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294071#comment-14294071 ] ASF GitHub Bot commented on STORM-636: -- Github user d2r commented on the pull request: https://github.com/apache/storm/pull/392#issuecomment-71714975 > Not to derail the discussion but personally, I would much rather not store errors in zk at all if its just for rendering the errors in UI. If the spouts/bolts could just store this in memory with some expiration that should suffice and we could expose an API at worker layer to get this information directly from it. If the host dies you lose some errors but that does not seem like a big deal. The only downside will be ui would now have to make requests against worker hosts to get erros but that seems ok to me, you would also get parallelism as all these worker calls can be made in parallel. I haven't thought this through completely and its probably much more work but I would love to hear your opinion. Yeah, we were thinking about distributing things this way too. We figured that the bigger problem is the heartbeats, and if we could get an improvement with less effort here, it would be worth it. It would be a much bigger change to distribute the errors out of ZK, yet maybe it is not a bad idea. (Also, I think it is good to persist the errors anyway, not just in memory. Users would like to see errors on the UI even if there was some issue that brought the supervisor down—like a rolling upgrade of the cluster.) Maybe we could file a JIRA for better gathering of errors. This change was intended to be small in scope and just give a way to get errors more efficiently when a topology has many, many components. It was prompted by seeing topology page load times of minutes from one of our customers. Plus, this may be less of a problem once heartbeats (and their metrics) are no longer getting sent around, but still it may not a bad idea to use a more distributed model like you suggest. > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295345#comment-14295345 ] ASF GitHub Bot commented on STORM-636: -- Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/392#issuecomment-71864014 I agree that ZK is not the ideal place to storm most things, but it is by far the most convenient. As such unless it is shown that it is causing a significant load on ZK I would rather leave it there until we can find/build a better place to put it. > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295348#comment-14295348 ] ASF GitHub Bot commented on STORM-636: -- Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/392#issuecomment-71864756 The code looks fine to me +1. Although I want to wait for @Parth-Brahmbhatt to finish his review before merging anything in. > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295353#comment-14295353 ] ASF GitHub Bot commented on STORM-636: -- Github user d2r commented on the pull request: https://github.com/apache/storm/pull/392#issuecomment-71865651 Agreed, I would like a review from @Parth-Brahmbhatt first if possible. > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295457#comment-14295457 ] ASF GitHub Bot commented on STORM-636: -- Github user Parth-Brahmbhatt commented on the pull request: https://github.com/apache/storm/pull/392#issuecomment-71878844 I am +1 too, the code looks good to me. > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components
[ https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303754#comment-14303754 ] ASF GitHub Bot commented on STORM-636: -- Github user asfgit closed the pull request at: https://github.com/apache/storm/pull/392 > UI/Monitor is slow for topologies with a large number of components > --- > > Key: STORM-636 > URL: https://issues.apache.org/jira/browse/STORM-636 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.10.0 >Reporter: Derek Dagit >Assignee: Derek Dagit >Priority: Minor > > The getTopologyInfo method in nimbus fetches from ZK all errors reported by > all components. This becomes too slow for topologies with a larger numbers > of components (bolts/spouts). > In one example, the UI consistently took over 5 minutes to load the topology > page for a topology with nearly 500 components while ZK was under load. > Errors are currently stored in ZooKeeper under individual znodes per > component. This means that each call to getTopologyInfo needs to list > children of each znode and then download the error znodes it finds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)