Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2319#discussion_r138080750
  
    --- Diff: storm-core/src/clj/org/apache/storm/daemon/nimbus.clj ---
    @@ -961,14 +1015,25 @@
         ;; tasks figure out what tasks to talk to by looking at topology at 
runtime
         ;; only log/set when there's been a change to the assignment
         (doseq [[topology-id assignment] new-assignments
    -            :let [existing-assignment (get existing-assignments 
topology-id)
    -                  topology-details (.getById topologies topology-id)]]
    +            :let [existing-assignment (get existing-assignments 
topology-id)]]
           (if (= existing-assignment assignment)
             (log-debug "Assignment for " topology-id " hasn't changed")
             (do
               (log-message "Setting new assignment for topology id " 
topology-id ": " (pr-str assignment))
               (.set-assignment! storm-cluster-state topology-id assignment)
               )))
    +
    +    ;; grouping assignment by node to see the nodes diff, then notify 
nodes/supervisors to synchronize its owned assignment
    +    ;; because the number of existing assignments is small for every 
scheduling round,
    +    ;; we expect to notify supervisors at almost the same time.
    +    (->> new-assignments
    +         (map (fn [[tid new-assignment]]
    +           (let [existing-assignment (get existing-assignments tid)]
    +             (assignment-changed-nodes existing-assignment new-assignment 
))))
    +         (apply concat)
    +         (into #{})
    +         (notify-supervisors-assignments conf new-assignments))
    --- End diff --
    
    I am very concerned about this.  One of our biggest issues is the timing of 
this loop.  It blocks scheduling a new topology, rebalancing a topology, and 
recovering from a failed worker or a failed supervisor.
    
    One of the key tenants of Hadoop has been to never let the main daemon 
reach out to the thousands of other processes that may be on the cluster, 
because if one of them blocks, or there are network issues or whatever it can 
slow done the rest of the cluster.
    
    I would really prefer to see a different model here.  At a minimum we need 
to have a thread pool that is used to notify the supervisors asynchronously, or 
even better we never actually reach out to the supervisors.  We just set 
something up in memory on nimbus and have the supervisors poll very frequently. 
 Nimbus is mostly single threaded so using more CPU is not a big deal.  We know 
that even 1,000 times a second for the DRPC servers is not really any load at 
all, so doing it a few times a second would not matter that much. 


---

Reply via email to