[ 
https://issues.apache.org/jira/browse/STORM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294071#comment-14294071
 ] 

ASF GitHub Bot commented on STORM-636:
--------------------------------------

Github user d2r commented on the pull request:

    https://github.com/apache/storm/pull/392#issuecomment-71714975
  
    > Not to derail the discussion but personally, I would much rather not 
store errors in zk at all if its just for rendering the errors in UI.  If the 
spouts/bolts could just store this in memory with some expiration that should 
suffice and we could expose an API at worker layer to get this information 
directly from it. If the host dies you lose some errors but that does not seem 
like a big deal. The only downside will be ui would now have to make requests 
against worker hosts to get erros but that seems ok to me, you would also get 
parallelism as all these worker calls can be made in parallel. I haven't 
thought this through completely and its probably much more work but I would 
love to hear your opinion.
    
    Yeah, we were thinking about distributing things this way too.  We figured 
that the bigger problem is the heartbeats, and if we could get an improvement 
with less effort here, it would be worth it.  It would be a much bigger change 
to distribute the errors out of ZK, yet maybe it is not a bad idea.  (Also, I 
think it is good to persist the errors anyway, not just in memory.  Users would 
like to see errors on the UI even if there was some issue that brought the 
supervisor down—like a rolling upgrade of the cluster.)  Maybe we could file a 
JIRA for better gathering of errors.
    
    This change was intended to be small in scope and just give a way to get 
errors more efficiently when a topology has many, many components.  It was 
prompted by seeing topology page load times of minutes from one of our 
customers.  Plus, this may be less of a problem once heartbeats (and their 
metrics) are no longer getting sent around, but still it may not a bad idea to 
use a more distributed model like you suggest.



> UI/Monitor is slow for topologies with a large number of components
> -------------------------------------------------------------------
>
>                 Key: STORM-636
>                 URL: https://issues.apache.org/jira/browse/STORM-636
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>            Reporter: Derek Dagit
>            Assignee: Derek Dagit
>            Priority: Minor
>
> The getTopologyInfo method in nimbus fetches from ZK all errors reported by 
> all components.  This becomes too slow for topologies with a larger numbers 
> of components  (bolts/spouts).
> In one example, the UI consistently took over 5 minutes to load the topology 
> page for a topology with nearly 500 components while ZK was under load.
> Errors are currently stored in ZooKeeper under individual znodes per 
> component.  This means that each call to getTopologyInfo needs to list 
> children of each znode and then download the error znodes it finds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to