[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287775#comment-14287775
 ] 

ASF GitHub Bot commented on STORM-636:
--

GitHub user d2r opened a pull request:

https://github.com/apache/storm/pull/392

[STORM-636] Faster, optional retrieval of last component error

We want to speed up interactions with the topology that call 
getTopologyInfo, including the Topology Page in the UI.

Before change:
- Errors written to /errors/topo-id/comp-name/eN, where N is a sequence 
number.
- getTopologyInfo grabs all /errors/topo-id/*/* errors from ZK

After this change:
- Errors written to /errors/topo-id/comp-name/eN, where N is a sequence 
number. (unchanged)
- Errors also written to /errors/topo-id/comp-name-last-error (this means 
there is an extra ZK write per error)
- New Nimbus thrift interface method `getTopologyInfoWithOpts`
  - Only option right now is how many errors to retrieve from ZK: 0, 1, or 
all.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/d2r/storm storm-636-ui-errors

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #392


commit 1cfa190f2efb06f8798984b43dec801e5ff20ad5
Author: Derek Dagit 
Date:   2015-01-22T16:46:03Z

Faster, optional retrieval of last component error




> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287803#comment-14287803
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user d2r commented on the pull request:

https://github.com/apache/storm/pull/392#issuecomment-71059798
  
For testing this I did the following:

I modified the ExclamationTopology as follows:
- bolts report a new RuntimeException  via reportError on each call to 
execute
- changed topology.max.error.report.per.interval: 100 (default 10), so each 
executor/thread will write up to 100 errors to ZK every 10 seconds.  This is a 
quick-and-dirty way of increasing write load on ZK for testing.
- Replicated the Exclamation bolt an additional ~450 times, so there are 
many more bolts in the topology.

Procedure:
- Launch the topology in Inactive state, wait for it to get assigned and 
stable
- go to the topology page and activate the topology (at this point the load 
time is on the order of seconds)
- Refresh the topology until I see fresh errors reported for each bolt
- Wait a a minute
- Refresh the page and record load times from the Firefox network inspector.
- Repeat 4 more times

Results:
  Before patch (median about a minute):
  - 57s
  - 1m13s
  - 55s
  - 51s
  - 1m34s

  With patch (median about 10s)
  - 12s
  - 6s
  - 8s
  - 13s
  - 10s



> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287845#comment-14287845
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user Parth-Brahmbhatt commented on a diff in the pull request:

https://github.com/apache/storm/pull/392#discussion_r23393694
  
--- Diff: storm-core/src/storm.thrift ---
@@ -243,6 +243,16 @@ struct SubmitOptions {
   2: optional Credentials creds;
 }
 
+enum NumErrorsChoice {
+  ALL,
+  NONE,
+  ONE
+}
+
+struct GetInfoOptions {
+  1: optional NumErrorsChoice num_err_choice;
--- End diff --

Instead of an Enum don't you think a pagination struct with a start and end 
would be more flexible? 


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293882#comment-14293882
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user d2r commented on a diff in the pull request:

https://github.com/apache/storm/pull/392#discussion_r23627700
  
--- Diff: storm-core/src/storm.thrift ---
@@ -243,6 +243,16 @@ struct SubmitOptions {
   2: optional Credentials creds;
 }
 
+enum NumErrorsChoice {
+  ALL,
+  NONE,
+  ONE
+}
+
+struct GetInfoOptions {
+  1: optional NumErrorsChoice num_err_choice;
--- End diff --

I hadn't thought about it that way, but I'm open to it.  Could you give an 
example of what you mean?


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293889#comment-14293889
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user d2r commented on a diff in the pull request:

https://github.com/apache/storm/pull/392#discussion_r23628156
  
--- Diff: storm-core/src/storm.thrift ---
@@ -243,6 +243,16 @@ struct SubmitOptions {
   2: optional Credentials creds;
 }
 
+enum NumErrorsChoice {
+  ALL,
+  NONE,
+  ONE
+}
+
+struct GetInfoOptions {
+  1: optional NumErrorsChoice num_err_choice;
--- End diff --

I should add, the current ZK storage of errors isn't very flexible.  We 
store at most 10 errors per component (hard-coded).  With only 10 errors to 
work with, I am not sure how much more flexibility we get by doing pagination.  
I guess this hard-coded limit of 10 is why I did not pursue pagination—it is 
just easier to give all 10 errors in the case that we want more than 1.


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293979#comment-14293979
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user Parth-Brahmbhatt commented on the pull request:

https://github.com/apache/storm/pull/392#issuecomment-71706395
  
I did not know we only stored 10 errors, makes sense to not go with 
pagination if its hardcoded to 10. I will take a look at the complete PR today.

Not to derail the discussion but personally, I would much rather not store 
errors in zk at all if its just for rendering the errors in UI. If the 
spouts/bolts could just store this in memory with some expiration that should 
suffice and we could expose  an API at worker layer to get this information 
directly from it. If the host dies you lose some errors but that does not seem 
like a big deal. The only downside will be ui would now have to make requests 
against worker hosts to get erros but that seems ok to me, you would also get 
parallelism as all these worker calls can be made in parallel. I haven't 
thought this through completely and its probably much more work but I would 
love to hear your opinion.


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294071#comment-14294071
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user d2r commented on the pull request:

https://github.com/apache/storm/pull/392#issuecomment-71714975
  
> Not to derail the discussion but personally, I would much rather not 
store errors in zk at all if its just for rendering the errors in UI.  If the 
spouts/bolts could just store this in memory with some expiration that should 
suffice and we could expose an API at worker layer to get this information 
directly from it. If the host dies you lose some errors but that does not seem 
like a big deal. The only downside will be ui would now have to make requests 
against worker hosts to get erros but that seems ok to me, you would also get 
parallelism as all these worker calls can be made in parallel. I haven't 
thought this through completely and its probably much more work but I would 
love to hear your opinion.

Yeah, we were thinking about distributing things this way too.  We figured 
that the bigger problem is the heartbeats, and if we could get an improvement 
with less effort here, it would be worth it.  It would be a much bigger change 
to distribute the errors out of ZK, yet maybe it is not a bad idea.  (Also, I 
think it is good to persist the errors anyway, not just in memory.  Users would 
like to see errors on the UI even if there was some issue that brought the 
supervisor down—like a rolling upgrade of the cluster.)  Maybe we could file a 
JIRA for better gathering of errors.

This change was intended to be small in scope and just give a way to get 
errors more efficiently when a topology has many, many components.  It was 
prompted by seeing topology page load times of minutes from one of our 
customers.  Plus, this may be less of a problem once heartbeats (and their 
metrics) are no longer getting sent around, but still it may not a bad idea to 
use a more distributed model like you suggest.



> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295345#comment-14295345
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user revans2 commented on the pull request:

https://github.com/apache/storm/pull/392#issuecomment-71864014
  
I agree that ZK is not the ideal place to storm most things, but it is by 
far the most convenient.  As such unless it is shown that it is causing a 
significant load on ZK I would rather leave it there until we can find/build a 
better place to put it.


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295348#comment-14295348
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user revans2 commented on the pull request:

https://github.com/apache/storm/pull/392#issuecomment-71864756
  
The code looks fine to me +1.  Although I want to wait for 
@Parth-Brahmbhatt to finish his review before merging anything in.


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295353#comment-14295353
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user d2r commented on the pull request:

https://github.com/apache/storm/pull/392#issuecomment-71865651
  
Agreed, I would like a review from @Parth-Brahmbhatt first if possible.


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-01-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295457#comment-14295457
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user Parth-Brahmbhatt commented on the pull request:

https://github.com/apache/storm/pull/392#issuecomment-71878844
  
I am +1 too, the code looks good to me.


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-636) UI/Monitor is slow for topologies with a large number of components

2015-02-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303754#comment-14303754
 ] 

ASF GitHub Bot commented on STORM-636:
--

Github user asfgit closed the pull request at:

https://github.com/apache/storm/pull/392


> UI/Monitor is slow for topologies with a large number of components
> ---
>
> Key: STORM-636
> URL: https://issues.apache.org/jira/browse/STORM-636
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Derek Dagit
>Assignee: Derek Dagit
>Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)