[ 
https://issues.apache.org/jira/browse/STORM-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382295#comment-15382295
 ] 

ASF GitHub Bot commented on STORM-1977:
---------------------------------------

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/1574
  
    The code to do the leader election based off of the IDs was removed because 
architecturally the blobstore should be in charge of maintaining high 
availability.  Exposing what is stored on the local disk in an implementation 
detail and exposing that to nimbus breaks a lot of things.  Not to mention is 
not really possible with some blobstore implementations, like the HDFS backed 
one.
    
    To offset this we made sure that you could access the version of the blob 
even if it is stored on a different instance of the blobstore.
    
    Additionally the old code did not guarantee consistency either.  There was 
a timeout where if nimbus waited long enough to get a copy of all of the items 
it gave up and would still become leader.  If there really is no guarantee why 
add in the extra overhead?



> Leader Nimbus crashes with getClusterInfo when it doesn't have one or more 
> replicated topology codes
> ----------------------------------------------------------------------------------------------------
>
>                 Key: STORM-1977
>                 URL: https://issues.apache.org/jira/browse/STORM-1977
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Critical
>
> While investigating STORM-1976, I found that there're cases for nimbus to not 
> having topology codes. 
> Before BlobStore, only nimbuses which is having all topology codes can gain 
> leadership, otherwise they give up leadership immediately. While introducing 
> BlobStore, this logic is removed.
> I don't know it's intended or not, but it incurs one of nimbus to gain 
> leadership which doesn't have replicated topology code, and the nimbus will 
> be crashed when getClusterInfo is requested.
> Easiest way to reproduce is:
> 1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick 
> workaround for resolving STORM-1976), and patch Storm cluster
> 2. Launch Nimbus 1 (leader)
> 3. Run topology
> 4. Kill Nimbus 1
> 5. Launch Nimbus 2 from different node
> 6. Nimbus 2 gains leadership 
> 7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed
> Log:
> {code}
> 2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob 
> store based in /grid/0/hadoop/storm/blobs
> ...
> 2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock.
> 2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership
> ...
> 2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for 
> storm version '1.1.1-SNAPSHOT'
> 2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error 
> processing getClusterInfo
> KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268)
> ...
>         at 
> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498)
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427)
> ...
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401)
>         at 
> org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838)
>         at 
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724)
>         at 
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708)
>         at 
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> ...
> 2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download 
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event
> KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239)
>         at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271)
>         at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300)
> ...
>        at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>         at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
>         at 
> org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548)
>         at 
> org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555)
>         at 
> org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912)
> ...
>         at 
> org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911)
>         at clojure.lang.RestFn.invoke(RestFn.java:410)
>         at 
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216)
>         at 
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215)
>         at 
> org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105)
>         at 
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50)
>         at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
> ...
> 2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when 
> processing an event")
> java.lang.RuntimeException: ("Error when processing an event")
>         at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
>         at clojure.lang.RestFn.invoke(RestFn.java:423)
>         at 
> org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205)
>         at 
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71)
>         at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
>         at clojure.lang.AFn.run(AFn.java:22)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to