[jira] [Commented] (STORM-1839) Kinesis Spout
[ https://issues.apache.org/jira/browse/STORM-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285208#comment-15285208 ] Erik Weathers commented on STORM-1839: -- Thanks [~sriharsha]! > Kinesis Spout > - > > Key: STORM-1839 > URL: https://issues.apache.org/jira/browse/STORM-1839 > Project: Apache Storm > Issue Type: Improvement >Reporter: Sriharsha Chintalapani >Assignee: Priyank Shah > > As Storm is increasingly used in Cloud environments. It will great to have a > Kinesis Spout integration in Apache Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1766) A better algorithm server rack selection for RAS
[ https://issues.apache.org/jira/browse/STORM-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-1766: - Description: Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack based on which rack has the most available resources this may be insufficient and may cause topologies not to be able to be scheduled successfully even though there are enough resources to schedule it in the cluster. We attempt to find the rack with the most resources by find the rack with the biggest sum of available memory + available cpu. This method is not effective since it does not consider the number of slots available. This method also fails in identifying racks that are not schedulable due to the exhaustion of one of the resources either memory, cpu, or slots. The current implementation also tries the initial scheduling on one rack and not try to schedule on all the racks before giving up which may cause topologies to be failed to be scheduled due to the above mentioned shortcomings in the current method. Also the current method does not consider failures of workers. When executors of a topology gets unassigned and needs to be scheduled again, the current logic in getBestClustering may be inadequate if not complete wrong. When executors needs to rescheduled due to a fault, getBestClustering will likely return a cluster that is different from where the majority of executors from the topology is originally scheduling in. Thus, I propose a different strategy/algorithm to find the "best" cluster. I have come up with a ordering strategy I dub subordinate resource availability ordering (inspired by Dominant Resource Fairness) that sorts racks by the subordinate (not dominant) resource availability. For example given 4 racks with the following resource availabilities {code} //generate some that has alot of memory but little of cpu rack-3 Avail [ CPU 100.0 MEM 20.0 Slots 40 ] Total [ CPU 100.0 MEM 20.0 Slots 40 ] //generate some supervisors that are depleted of one resource rack-2 Avail [ CPU 0.0 MEM 8.0 Slots 40 ] Total [ CPU 0.0 MEM 8.0 Slots 40 ] //generate some that has a lot of cpu but little of memory rack-4 Avail [ CPU 6100.0 MEM 1.0 Slots 40 ] Total [ CPU 6100.0 MEM 1.0 Slots 40 ] //generate another rack of supervisors with less resources than rack-0 rack-1 Avail [ CPU 2000.0 MEM 4.0 Slots 40 ] Total [ CPU 2000.0 MEM 4.0 Slots 40 ] rack-0 Avail [ CPU 4000.0 MEM 8.0 Slots 40( ] Total [ CPU 4000.0 MEM 8.0 Slots 40 ] Cluster Overall Avail [ CPU 12200.0 MEM 41.0 Slots 200 ] Total [ CPU 12200.0 MEM 41.0 Slots 200 ] {code} It is clear that rack-0 is the best cluster since its the most balanced and can potentially schedule the most executors, while rack-2 is the worst rack since rack-2 is depleted of cpu resource thus rendering it unschedulable even though there are other resources available. We first calculate the resource availability percentage of all the racks for each resource by computing: {code} (resource available on rack) / (resource available in cluster) {code} We do this calculation to normalize the values otherwise the resource values would not be comparable. So for our example: {code} rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] effective resources: 0.00819672131147541 rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: 0.0 rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective resources: 0.024390243902439025 rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] effective resources: 0.0975609756097561 rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] effective resources: 0.1951219512195122 {code} The effective resource of a rack, which is also the subordinate resource, is computed by: {code} MIN(resource availability percentage of {CPU, Memory, # of free Slots}). {code} Then we order the racks by the effective resource. Thus for our example: {code} Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2] {code} Also to deal with the presence of failures, if a topology is partially scheduled, we find the rack with the most scheduled executors for the topology and we try to schedule on that rack first. Thus for the sorting for racks. We first sort by the number of executors already scheduled on the rack and then by the subordinate resource availability. was: Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack based on which rack has the most available resources this may be insufficient and may cause topologies not to be able to be scheduled successfully even though there are enough resources to schedule it in the cluster. We attempt to find the rack with the most resources by find the rack with the biggest sum of available memory + available cpu. This method
[jira] [Updated] (STORM-1766) A better algorithm server rack selection for RAS
[ https://issues.apache.org/jira/browse/STORM-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-1766: - Description: Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack based on which rack has the most available resources this may be insufficient and may cause topologies not to be able to be scheduled successfully even though there are enough resources to schedule it in the cluster. We attempt to find the rack with the most resources by find the rack with the biggest sum of available memory + available cpu. This method is not effective since it does not consider the number of slots available. This method also fails in identifying racks that are not schedulable due to the exhaustion of one of the resources either memory, cpu, or slots. The current implementation also tries the initial scheduling on one rack and not try to schedule on all the racks before giving up which may cause topologies to be failed to be scheduled due to the above mentioned shortcomings in the current method. Also the current method does not consider failures of workers. When executors of a topology gets unassigned and needs to be scheduled again, the current logic in getBestClustering may be inadequate if not complete wrong. When executors needs to rescheduled due to a fault, getBestClustering will likely return a cluster that is different from where the majority of executors from the topology is originally scheduling in. Thus, I propose a different strategy/algorithm to find the "best" cluster. I have come up with a ordering strategy I dub subordinate resource availability ordering (inspired by Dominant Resource Fairness) that sorts racks by the subordinate (not dominant) resource availability. For example given 4 racks with the following resource availabilities {code} //generate some that has alot of memory but little of cpu rack-3 Avail [ CPU 100.0 MEM 20.0 Slots 40 ] Total [ CPU 100.0 MEM 20.0 Slots 40 ] //generate some supervisors that are depleted of one resource rack-2 Avail [ CPU 0.0 MEM 8.0 Slots 40 ] Total [ CPU 0.0 MEM 8.0 Slots 40 ] //generate some that has a lot of cpu but little of memory rack-4 Avail [ CPU 6100.0 MEM 1.0 Slots 40 ] Total [ CPU 6100.0 MEM 1.0 Slots 40 ] //generate another rack of supervisors with less resources than rack-0 rack-1 Avail [ CPU 2000.0 MEM 4.0 Slots 40 ] Total [ CPU 2000.0 MEM 4.0 Slots 40 ] rack-0 Avail [ CPU 4000.0 MEM 8.0 Slots 40( ] Total [ CPU 4000.0 MEM 8.0 Slots 40 ] Cluster Overall Avail [ CPU 12200.0 MEM 41.0 Slots 200 ] Total [ CPU 12200.0 MEM 41.0 Slots 200 ] {code} It is clear that rack-0 is the best cluster since its the most balanced and can potentially schedule the most executors, while rack-2 is the worst rack since rack-2 is depleted of cpu resource thus rendering it unschedulable even though there are other resources available. We first calculate the resource availability percentage of all the racks for each resource by computing: (resource available on rack) / (resource available in cluster) We do this calculation to normalize the values otherwise the resource values would not be comparable. So for our example: {code} rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] effective resources: 0.00819672131147541 rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: 0.0 rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective resources: 0.024390243902439025 rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] effective resources: 0.0975609756097561 rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] effective resources: 0.1951219512195122 {code} The effective resource of a rack, which is also the subordinate resource, is computed by: {code} MIN(resource availability percentage of {CPU, Memory, # of free Slots}). {code} Then we order the racks by the effective resource. Thus for our example: {code} Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2] {code} Also to deal with the presence of failures, if a topology is partially scheduled, we find the rack with the most scheduled executors for the topology and we try to schedule on that rack first. Thus for the sorting for racks. We first sort by the number of executors already scheduled on the rack and then by the subordinate resource availability. was: Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack based on which rack has the most available resources this may be insufficient and may cause topologies not to be able to be scheduled successfully even though there are enough resources to schedule it in the cluster. We attempt to find the rack with the most resources by find the rack with the biggest sum of available memory + available cpu. This method is not
[jira] [Resolved] (STORM-143) Launching a process throws away standard out; can hang
[ https://issues.apache.org/jira/browse/STORM-143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers resolved STORM-143. - Resolution: Fixed Fix Version/s: 0.10.0 > Launching a process throws away standard out; can hang > -- > > Key: STORM-143 > URL: https://issues.apache.org/jira/browse/STORM-143 > Project: Apache Storm > Issue Type: Bug > Components: storm-core >Reporter: James Xu >Priority: Minor > Fix For: 0.10.0 > > > https://github.com/nathanmarz/storm/issues/489 > https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/util.clj#L349 > When we launch a process, standard out is written to a system buffer and does > not appear to be read. Also, nothing is redirected to standard in. This can > have the following effects: > A worker can hang when initializing (e.g. UnsatisfiedLinkError looking for > jzmq), and it will be unable to communicate the error as standard out is > being swallowed. > A process that writes too much to standard out will block if the buffer fills > A process that tries to read form standard in for any reason will block. > Perhaps we can redirect standard out to an .out file, and redirect /dev/null > to the standard in stream of the process? > -- > nathanmarz: Storm redirects stdout to the logging system. It's worked fine > for us in our topologies. > -- > d2r: We see in worker.clj, in mk-worker, where there is a call to > redirect-stdio-to-slf4j!. This would not seem to help in cases such as we are > seeing when there is a problem launching the worker itself. > (defn -main [storm-id assignment-id port-str worker-id] > (let [conf1 (read-storm-config) > login_conf_file (System/getProperty "java.security.auth.login.config") > conf (if login_conf_file (merge conf1 > {"java.security.auth.login.config" login_conf_file}) conf1)] > (validate-distributed-mode! conf) > (mk-worker conf nil (java.net.URLDecoder/decode storm-id) assignment-id > (Integer/parseInt port-str) worker-id))) > If anything were to go wrong (CLASSPATH, jvm opts, misconfiguration...) > before -main or before mk-worker, then any output would be lost. The symptom > we saw was that the topology sat around apparently doing nothing, yet there > was no log indicating that the workers were failing to start. > Is there other redirection to logs that I'm missing? > -- > xiaokang: we use bash to launch worker process and redirect its stdout to > woker-port.out file. it heleped us find the zeromq jni problem that cause the > jvm crash without any log. > -- > nathanmarz: @d2r Yea, that's all I was referring to. If we redirect stdout, > will the code that redirects stdout to the logging system still take effect? > This is important because we can control the size of the logfiles (via the > logback config) but not the size of the redirected stdout file. > -- > d2r: My hunch is that it will work as it does now, except that any messages > that are getting thrown away before that point would go to a file instead. I > can play with it and find out. We wouldn't want to change the redirection, > just restore visibility to any output that might occur prior to the > redirection. There should be some safety valve to control the size of any new > .out in case something goes berserk. > @xiaokang I see how that would work. We also need to make sure redirection > continues to work as it currently does for the above reason. > -- > xiaokang: @d2r @nathanmarz In out cluster, storm's stdout redirection still > works for any System.out output while JNI errors goes to worker-port.out > file. I think it will be nice to use the same worker-port.log file for bash > stdout redirection since logback can control log file size. But it is a > little bit ugly to use bash to launch worker java process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-143) Launching a process throws away standard out; can hang
[ https://issues.apache.org/jira/browse/STORM-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261323#comment-15261323 ] Erik Weathers commented on STORM-143: - Aha (hadn't clicked the {{...}} on the GitHub UI)! I'll mark this ticket as closed then, thanks! > Launching a process throws away standard out; can hang > -- > > Key: STORM-143 > URL: https://issues.apache.org/jira/browse/STORM-143 > Project: Apache Storm > Issue Type: Bug > Components: storm-core >Reporter: James Xu >Priority: Minor > > https://github.com/nathanmarz/storm/issues/489 > https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/util.clj#L349 > When we launch a process, standard out is written to a system buffer and does > not appear to be read. Also, nothing is redirected to standard in. This can > have the following effects: > A worker can hang when initializing (e.g. UnsatisfiedLinkError looking for > jzmq), and it will be unable to communicate the error as standard out is > being swallowed. > A process that writes too much to standard out will block if the buffer fills > A process that tries to read form standard in for any reason will block. > Perhaps we can redirect standard out to an .out file, and redirect /dev/null > to the standard in stream of the process? > -- > nathanmarz: Storm redirects stdout to the logging system. It's worked fine > for us in our topologies. > -- > d2r: We see in worker.clj, in mk-worker, where there is a call to > redirect-stdio-to-slf4j!. This would not seem to help in cases such as we are > seeing when there is a problem launching the worker itself. > (defn -main [storm-id assignment-id port-str worker-id] > (let [conf1 (read-storm-config) > login_conf_file (System/getProperty "java.security.auth.login.config") > conf (if login_conf_file (merge conf1 > {"java.security.auth.login.config" login_conf_file}) conf1)] > (validate-distributed-mode! conf) > (mk-worker conf nil (java.net.URLDecoder/decode storm-id) assignment-id > (Integer/parseInt port-str) worker-id))) > If anything were to go wrong (CLASSPATH, jvm opts, misconfiguration...) > before -main or before mk-worker, then any output would be lost. The symptom > we saw was that the topology sat around apparently doing nothing, yet there > was no log indicating that the workers were failing to start. > Is there other redirection to logs that I'm missing? > -- > xiaokang: we use bash to launch worker process and redirect its stdout to > woker-port.out file. it heleped us find the zeromq jni problem that cause the > jvm crash without any log. > -- > nathanmarz: @d2r Yea, that's all I was referring to. If we redirect stdout, > will the code that redirects stdout to the logging system still take effect? > This is important because we can control the size of the logfiles (via the > logback config) but not the size of the redirected stdout file. > -- > d2r: My hunch is that it will work as it does now, except that any messages > that are getting thrown away before that point would go to a file instead. I > can play with it and find out. We wouldn't want to change the redirection, > just restore visibility to any output that might occur prior to the > redirection. There should be some safety valve to control the size of any new > .out in case something goes berserk. > @xiaokang I see how that would work. We also need to make sure redirection > continues to work as it currently does for the above reason. > -- > xiaokang: @d2r @nathanmarz In out cluster, storm's stdout redirection still > works for any System.out output while JNI errors goes to worker-port.out > file. I think it will be nice to use the same worker-port.log file for bash > stdout redirection since logback can control log file size. But it is a > little bit ugly to use bash to launch worker java process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1733) Logs from bin/storm are lost because stdout and stderr are not flushed
[ https://issues.apache.org/jira/browse/STORM-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261166#comment-15261166 ] Erik Weathers commented on STORM-1733: -- [gigantic auto-comment above|https://issues.apache.org/jira/browse/STORM-1733?focusedCommentId=15261156=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15261156] is an example of why I want to disable the automatic uploading of all GitHub stuff into JIRA. > Logs from bin/storm are lost because stdout and stderr are not flushed > -- > > Key: STORM-1733 > URL: https://issues.apache.org/jira/browse/STORM-1733 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.9.3, 0.10.0, 0.9.4, 0.9.5, 0.9.6 >Reporter: Karthick Duraisamy Soundararaj >Assignee: Karthick Duraisamy Soundararaj > > bin/storm.py emits the following crucial information that is lost because we > don't flush the stdout before exec. > {code} > 2016-04-25T08:23:43.17141 Running: java -server -Dstorm.options= > -Dstorm.home= -Xmx1024m -Dlogfile.name=nimbus.log > -Dlogback.configurationFile=logback/cluster.xml backtype.storm.ui.core.nimbus > {code} > Observed Environment: > {code} > OS: CentOS release 6.5 > Kernel: 2.6.32-431.el6.x86_64 > Python version: Python 2.7.2 > {code} > For example, I using runit to start storm components like nimbus, ui, etc and > the problem is applicable to all the components and in all the cases, I am > not seeing logs that are emitted by bin/storm before {{os.execvp}} is called > to actually launch the component. > Please note that in cases where stdout and stderr is terminal, the stdout and > stderr are always flushed and the bug is not applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-143) Launching a process throws away standard out; can hang
[ https://issues.apache.org/jira/browse/STORM-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259806#comment-15259806 ] Erik Weathers commented on STORM-143: - [~revans2] : seems this issue is fixed with the LogWriter that was introduced in storm-0.10.0. I cannot find a ticket for that feature to link this against though. > Launching a process throws away standard out; can hang > -- > > Key: STORM-143 > URL: https://issues.apache.org/jira/browse/STORM-143 > Project: Apache Storm > Issue Type: Bug > Components: storm-core >Reporter: James Xu >Priority: Minor > > https://github.com/nathanmarz/storm/issues/489 > https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/util.clj#L349 > When we launch a process, standard out is written to a system buffer and does > not appear to be read. Also, nothing is redirected to standard in. This can > have the following effects: > A worker can hang when initializing (e.g. UnsatisfiedLinkError looking for > jzmq), and it will be unable to communicate the error as standard out is > being swallowed. > A process that writes too much to standard out will block if the buffer fills > A process that tries to read form standard in for any reason will block. > Perhaps we can redirect standard out to an .out file, and redirect /dev/null > to the standard in stream of the process? > -- > nathanmarz: Storm redirects stdout to the logging system. It's worked fine > for us in our topologies. > -- > d2r: We see in worker.clj, in mk-worker, where there is a call to > redirect-stdio-to-slf4j!. This would not seem to help in cases such as we are > seeing when there is a problem launching the worker itself. > (defn -main [storm-id assignment-id port-str worker-id] > (let [conf1 (read-storm-config) > login_conf_file (System/getProperty "java.security.auth.login.config") > conf (if login_conf_file (merge conf1 > {"java.security.auth.login.config" login_conf_file}) conf1)] > (validate-distributed-mode! conf) > (mk-worker conf nil (java.net.URLDecoder/decode storm-id) assignment-id > (Integer/parseInt port-str) worker-id))) > If anything were to go wrong (CLASSPATH, jvm opts, misconfiguration...) > before -main or before mk-worker, then any output would be lost. The symptom > we saw was that the topology sat around apparently doing nothing, yet there > was no log indicating that the workers were failing to start. > Is there other redirection to logs that I'm missing? > -- > xiaokang: we use bash to launch worker process and redirect its stdout to > woker-port.out file. it heleped us find the zeromq jni problem that cause the > jvm crash without any log. > -- > nathanmarz: @d2r Yea, that's all I was referring to. If we redirect stdout, > will the code that redirects stdout to the logging system still take effect? > This is important because we can control the size of the logfiles (via the > logback config) but not the size of the redirected stdout file. > -- > d2r: My hunch is that it will work as it does now, except that any messages > that are getting thrown away before that point would go to a file instead. I > can play with it and find out. We wouldn't want to change the redirection, > just restore visibility to any output that might occur prior to the > redirection. There should be some safety valve to control the size of any new > .out in case something goes berserk. > @xiaokang I see how that would work. We also need to make sure redirection > continues to work as it currently does for the above reason. > -- > xiaokang: @d2r @nathanmarz In out cluster, storm's stdout redirection still > works for any System.out output while JNI errors goes to worker-port.out > file. I think it will be nice to use the same worker-port.log file for bash > stdout redirection since logback can control log file size. But it is a > little bit ugly to use bash to launch worker java process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-954) Topology Event Inspector
[ https://issues.apache.org/jira/browse/STORM-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-954: Summary: Topology Event Inspector (was: Toplogy Event Inspector) > Topology Event Inspector > > > Key: STORM-954 > URL: https://issues.apache.org/jira/browse/STORM-954 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Reporter: Sriharsha Chintalapani >Assignee: Arun Mahadevan > Fix For: 1.0.0 > > > •Ability to view tuples flowing through the topology > •Ability to turn on/off debug events without having to stop/restart topology > •Default debug events is off > •User should be able to select a specific Spout or Bolt and see incoming > events and outgoing events > •We could put a configurable numbers of events to view (e.g. last 100 events > or last 1 minute) > •Tuple stream to have following info > •Message id, batch/transaction id, name/value pair, timestamp, acked (boolean) > •All the above to be available from Storm UI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1056) allow supervisor log filename to be configurable via ENV variable
[ https://issues.apache.org/jira/browse/STORM-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209115#comment-15209115 ] Erik Weathers commented on STORM-1056: -- [~kabhwan]: ahh, seems that [the release notes for storm 0.10.0|https://storm.apache.org/2015/11/05/storm0100-released.html] were just missing STORM-1056, but it's actually present in v0.10.0: * https://github.com/apache/storm/blob/v0.10.0/bin/storm.py#L80 And in the binary release tarball: {code} (/tmp) % wget http://www.carfab.com/apachesoftware/storm/apache-storm-0.10.0/apache-storm-0.10.0.tar.gz ... (/tmp) % tar -xf apache-storm-0.10.0.tar.gz (/tmp/apache-storm-0.10.0) % grep SUPERVI bin/storm.py STORM_SUPERVISOR_LOG_FILE = os.getenv('STORM_SUPERVISOR_LOG_FILE', "supervisor.log") "-Dlogfile.name=" + STORM_SUPERVISOR_LOG_FILE, {code} > allow supervisor log filename to be configurable via ENV variable > - > > Key: STORM-1056 > URL: https://issues.apache.org/jira/browse/STORM-1056 > Project: Apache Storm > Issue Type: Task > Components: storm-core >Reporter: Erik Weathers >Assignee: Erik Weathers >Priority: Minor > Fix For: 0.9.6 > > > *Requested feature:* allow configuring the supervisor's log filename when > launching it via an ENV variable. > *Motivation:* The storm-on-mesos project (https://github.com/mesos/storm) > relies on multiple Storm Supervisor processes per worker host, where each > Supervisor is dedicated to a particular topology. This is part of the > framework's functionality of separating topologies from each other. i.e., > storm-on-mesos is a multi-tenant system. But before the change requested in > this issue, the logs from all supervisors on a worker host will be written > into a supervisor log with a single name of supervisor.log. If all logs are > written to a common location on the mesos host, then all logs go to the same > log file. Instead it would be desirable to separate the supervisor logs > per-topology, so that each tenant/topology-owner can peruse the logs that are > related to their own topology. Thus this ticket is requesting the ability to > configure the supervisor log via an environment variable whilst invoking > bin/storm.py (or bin/storm in pre-0.10 storm releases). > When this ticket is fixed, we will include the topology ID into the > supervisor log filename for storm-on-mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1631) storm CGroup bugs 1) when launching workers as the user that submitted the topology 2) when initial cleanup of cgroup fails
[ https://issues.apache.org/jira/browse/STORM-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-1631: - Summary: storm CGroup bugs 1) when launching workers as the user that submitted the topology 2) when initial cleanup of cgroup fails (was: torm CGroup bugs 1) when launching workers as the user that submitted the topology 2) when initial cleanup of cgroup fails) > storm CGroup bugs 1) when launching workers as the user that submitted the > topology 2) when initial cleanup of cgroup fails > --- > > Key: STORM-1631 > URL: https://issues.apache.org/jira/browse/STORM-1631 > Project: Apache Storm > Issue Type: Bug >Reporter: Boyang Jerry Peng >Assignee: Boyang Jerry Peng > > In secure multitenant storm, topology workers are launched with permission of > the user that submitted the topology. This causes a problem with cgroups > since workers are launched with permissions of the topology user which does > not have permissions to modify cgroups storm is using > Also, the clean up code is not trying to clean up cgroups of killed workers > if the initial attempt failed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1342) support multiple logviewers per host for container-isolated worker logs
[ https://issues.apache.org/jira/browse/STORM-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118650#comment-15118650 ] Erik Weathers commented on STORM-1342: -- STORM-1494 is adding support for the supervisor logs to be linked from the Nimbus UI. So this will likely be another area to adjust when (if!?) this is fixed. > support multiple logviewers per host for container-isolated worker logs > --- > > Key: STORM-1342 > URL: https://issues.apache.org/jira/browse/STORM-1342 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Reporter: Erik Weathers >Priority: Minor > > h3. Storm-on-Mesos Worker Logs are in varying directories > When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each > topology's workers are isolated into separate containers. By default the > worker logs will be saved into container-specific sandbox directories. These > directories are also topology-specific by definition, because, as just > stated, the containers are specific to each topology. > h3. Problem: Storm supports 1-and-only-1 Logviewer per Worker Host > A challenge with this different way of running Storm is that the [Storm > logviewer|https://github.com/apache/storm/blob/768a85926373355c15cc139fd86268916abc6850/docs/_posts/2013-12-08-storm090-released.md#log-viewer-ui] > runs as a single instance on each worker host. This doesn't play well with > having the topology worker logs in separate per-topology containers. The one > logviewer doesn't know about the various sandbox directories that the Storm > Workers are writing to. And if we just spawned new logviewers for each > container, the problem is that the Storm UI only knows about 1 global port > the logviewer, so you cannot just direct. > These problems are documented (or linked to) from [Issue #6 in the > storm-on-mesos project|https://github.com/mesos/storm/issues/6] > h3. Possible Solutions I can envision > # configure the Storm workers to write to log directories that exist on the > raw host outside of the container sandbox, and run a single logviewer on a > host, which serves up the contents of that directory. > #* violates one of the basic reasons for using containers: isolation. > #* also prevents allow a standard use case for Mesos: running more than 1 > instance of a Mesos Framework (e.g., "Storm Cluster") at once on same Mesos > Cluster. e.g., for Blue-Green deployments. > #* a variation on this proposal is to somehow expose the sandbox dirs of all > storm containers to this singleton logviewer process (still has above > problems) > # launch a separate logviewers in each container, and somehow register those > logviewers with Storm such that Storm knows for a given host which logviewer > port is assigned to a given topology. > #* this is the proposed solution > h3. Storm Changes for the Proposed Solution > Nimbus or ZooKeeper could serve as a registrar, recording the association > between a slot (host + worker port) and the logviewer port that is serving > the workers logs. And the Storm-on-Mesos framework could update this registry > when launching a new worker. (This proposal definitely calls for thorough > vetting and thinking.) > h3. Storm-on-Mesos Framework Changes for the Proposed Solution > Along with the interaction with the "registrar" proposed above, the > storm-on-mesos framework can be enhanced to launch multiple logviewers on a > given worker host, where each logviewer is dedicated to serving the worker > logs from a specific topology's container/sandbox directory. This would be > done by launching a logviewer process within the topology's container, and > assigning it an arbitrary listening port that has been determined dynamically > through mesos (which treats ports as one of the schedulable resource > primitives of a worker host). [Code implementing this > logviewer-port-allocation logic already > exists|https://github.com/mesos/storm/commit/af8c49beac04b530c33c1401c829caaa8e368a35], > but [that specific portion of the code was > reverted|https://github.com/mesos/storm/commit/dc3eee0f0e9c06f6da7b2fe697a8e4fc05b5227e] > because of the issues that inspired this ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1141) Maven Central does not have 0.10.0 libraries
[ https://issues.apache.org/jira/browse/STORM-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111524#comment-15111524 ] Erik Weathers commented on STORM-1141: -- [~cburch]: 0.10.x and 0.10.0 don't have {{ClusterSummary.nimbuses}}: * https://github.com/apache/storm/blob/v0.10.0/storm-core/src/jvm/backtype/storm/generated/ClusterSummary.java#L68-L70 * https://github.com/apache/storm/blob/0.10.x-branch/storm-core/src/jvm/backtype/storm/generated/ClusterSummary.java#L68-L70 That field [landed into master|https://github.com/apache/storm/commit/4502bffbe3f9b4cd3674a56afbda1bb115cec239] and wasn't put into 0.10.0. I believe it's part of the HA Nimbus support that is in 0.11.x. > Maven Central does not have 0.10.0 libraries > > > Key: STORM-1141 > URL: https://issues.apache.org/jira/browse/STORM-1141 > Project: Apache Storm > Issue Type: Bug >Reporter: caleb burch >Assignee: P. Taylor Goetz >Priority: Blocker > Fix For: 0.10.0 > > > HDP has moved to 2.3 that features Storm 0.10.0. The current storm-core jars > on maven central are back at 0.9.5 and the beta 0.10.0 drivers aren't up to > date. (They lack the list of nimbus nodes so fail with a > "nimbus.uptime.secs" not set error when attempting to get ClusterInfo via the > java client). > Any chance the latest 0.10.x build can be pushed to maven, or a timeframe of > when you expect to do it? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-822) As a storm developer I’d like to use the new kafka consumer API (0.8.3) to reduce dependencies and use long term supported kafka apis
[ https://issues.apache.org/jira/browse/STORM-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107726#comment-15107726 ] Erik Weathers commented on STORM-822: - [~DeepNekro]: can you please comment on whether your work directly overlaps with STORM-1015? > As a storm developer I’d like to use the new kafka consumer API (0.8.3) to > reduce dependencies and use long term supported kafka apis > -- > > Key: STORM-822 > URL: https://issues.apache.org/jira/browse/STORM-822 > Project: Apache Storm > Issue Type: Story > Components: storm-kafka >Reporter: Thomas Becker >Assignee: Hugo Louro > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1342) support multiple logviewers per host for container-isolated worker logs
[ https://issues.apache.org/jira/browse/STORM-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-1342: - Description: h3. Storm-on-Mesos Worker Logs are in varying directories When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each topology's workers are isolated into separate containers. By default the worker logs will be saved into container-specific sandbox directories. These directories are also topology-specific by definition, because, as just stated, the containers are specific to each topology. h3. Problem: Storm supports 1-and-only-1 Logviewer per Worker Host A challenge with this different way of running Storm is that the [Storm logviewer|https://github.com/apache/storm/blob/768a85926373355c15cc139fd86268916abc6850/docs/_posts/2013-12-08-storm090-released.md#log-viewer-ui] runs as a single instance on each worker host. This doesn't play well with having the topology worker logs in separate per-topology containers. The one logviewer doesn't know about the various sandbox directories that the Storm Workers are writing to. And if we just spawned new logviewers for each container, the problem is that the Storm UI only knows about 1 global port the logviewer, so you cannot just direct. These problems are documented (or linked to) from [Issue #6 in the storm-on-mesos project|https://github.com/mesos/storm/issues/6] h3. Possible Solutions I can envision # configure the Storm workers to write to log directories that exist on the raw host outside of the container sandbox, and run a single logviewer on a host, which serves up the contents of that directory. #* violates one of the basic reasons for using containers: isolation. #* also prevents allow a standard use case for Mesos: running more than 1 instance of a Mesos Framework (e.g., "Storm Cluster") at once on same Mesos Cluster. e.g., for Blue-Green deployments. #* a variation on this proposal is to somehow expose the sandbox dirs of all storm containers to this singleton logviewer process (still has above problems) # launch a separate logviewers in each container, and somehow register those logviewers with Storm such that Storm knows for a given host which logviewer port is assigned to a given topology. #* this is the proposed solution h3. Storm Changes for the Proposed Solution Nimbus or ZooKeeper could serve as a registrar, recording the association between a slot (host + worker port) and the logviewer port that is serving the workers logs. And the Storm-on-Mesos framework could update this registry when launching a new worker. (This proposal definitely calls for thorough vetting and thinking.) h3. Storm-on-Mesos Framework Changes for the Proposed Solution Along with the interaction with the "registrar" proposed above, the storm-on-mesos framework can be enhanced to launch multiple logviewers on a given worker host, where each logviewer is dedicated to serving the worker logs from a specific topology's container/sandbox directory. This would be done by launching a logviewer process within the topology's container, and assigning it an arbitrary listening port that has been determined dynamically through mesos (which treats ports as one of the schedulable resource primitives of a worker host). [Code implementing this logviewer-port-allocation logic already exists|https://github.com/mesos/storm/commit/af8c49beac04b530c33c1401c829caaa8e368a35], but [that specific portion of the code was reverted|https://github.com/mesos/storm/commit/dc3eee0f0e9c06f6da7b2fe697a8e4fc05b5227e] because of the issues that inspired this ticket. was: h3. Storm-on-Mesos Worker Logs are in varying directories When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each topology's workers are isolated into separate containers. By default the worker logs will be saved into container-specific sandbox directories. These directories are also topology-specific by definition, because, as just stated, the containers are specific to each topology. h3. Problem: Storm supports 1-and-only-1 Logviewer per Worker Host A challenge with this different way of running Storm is that the [Storm logviewer|https://github.com/apache/storm/blob/768a85926373355c15cc139fd86268916abc6850/docs/_posts/2013-12-08-storm090-released.md#log-viewer-ui] runs as a single instance on each worker host. This doesn't play well with having the topology worker logs in separate per-topology containers. The one logviewer doesn't know about the various sandbox directories that the Storm Workers are writing to. And if we just spawned new logviewers for each container, the problem is that the Storm UI only knows about 1 global port the logviewer, so you cannot just direct. h3. Possible Solutions I can envision # configure the Storm workers to write to log directories that exist on the raw host outside of the container
[jira] [Created] (STORM-1342) support multiple logviewers per host for container-isolated worker logs
Erik Weathers created STORM-1342: Summary: support multiple logviewers per host for container-isolated worker logs Key: STORM-1342 URL: https://issues.apache.org/jira/browse/STORM-1342 Project: Apache Storm Issue Type: Improvement Components: storm-core Reporter: Erik Weathers Priority: Minor h3. Storm-on-Mesos Worker Logs are in varying directories When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each topology's workers are isolated into separate containers. By default the worker logs will be saved into container-specific sandbox directories. These directories are also topology-specific by definition, because, as just stated, the containers are specific to each topology. h3. Problem: Storm supports 1-and-only-1 Logviewer per Worker Host A challenge with this different way of running Storm is that the [Storm logviewer|https://github.com/apache/storm/blob/768a85926373355c15cc139fd86268916abc6850/docs/_posts/2013-12-08-storm090-released.md#log-viewer-ui] runs as a single instance on each worker host. This doesn't play well with having the topology worker logs in separate per-topology containers. The one logviewer doesn't know about the various sandbox directories that the Storm Workers are writing to. And if we just spawned new logviewers for each container, the problem is that the Storm UI only knows about 1 global port the logviewer, so you cannot just direct. h3. Possible Solutions I can envision # configure the Storm workers to write to log directories that exist on the raw host outside of the container sandbox, and run a single logviewer on a host, which serves up the contents of that directory. #* violates one of the basic reasons for using containers: isolation. #* also prevents allow a standard use case for Mesos: running more than 1 instance of a Mesos Framework (e.g., "Storm Cluster") at once on same Mesos Cluster. e.g., for Blue-Green deployments. #* a variation on this proposal is to somehow expose the sandbox dirs of all storm containers to this singleton logviewer process (still has above problems) # launch a separate logviewers in each container, and somehow register those logviewers with Storm such that Storm knows for a given host which logviewer port is assigned to a given topology. #* this is the proposed solution h3. Storm Changes for the Proposed Solution Nimbus or ZooKeeper could serve as a registrar, recording the association between a slot (host + worker port) and the logviewer port that is serving the workers logs. And the Storm-on-Mesos framework could update this registry when launching a new worker. (This proposal definitely calls for thorough vetting and thhinking.) h3. Storm-on-Mesos Framework Changes for the Proposed Solution Along with the interaction with the "registrar" proposed above, the storm-on-mesos framework can be enhanced to launch multiple logviewers on a given worker host, where each logviewer is dedicated to serving the worker logs from a specific topology's container/sandbox directory. This would be done by launching a logviewer process within the topology's container, and assigning it an arbitrary listening port that has been determined dynamically through mesos (which treats ports as one of the schedulable resource primitives of a worker host). [Code implementing this logviewer-port-allocation logic already exists|https://github.com/mesos/storm/commit/af8c49beac04b530c33c1401c829caaa8e368a35], but [that specific portion of the code was reverted|https://github.com/mesos/storm/commit/dc3eee0f0e9c06f6da7b2fe697a8e4fc05b5227e] because of the issues that inspired this ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1342) support multiple logviewers per host for container-isolated worker logs
[ https://issues.apache.org/jira/browse/STORM-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-1342: - Description: h3. Storm-on-Mesos Worker Logs are in varying directories When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each topology's workers are isolated into separate containers. By default the worker logs will be saved into container-specific sandbox directories. These directories are also topology-specific by definition, because, as just stated, the containers are specific to each topology. h3. Problem: Storm supports 1-and-only-1 Logviewer per Worker Host A challenge with this different way of running Storm is that the [Storm logviewer|https://github.com/apache/storm/blob/768a85926373355c15cc139fd86268916abc6850/docs/_posts/2013-12-08-storm090-released.md#log-viewer-ui] runs as a single instance on each worker host. This doesn't play well with having the topology worker logs in separate per-topology containers. The one logviewer doesn't know about the various sandbox directories that the Storm Workers are writing to. And if we just spawned new logviewers for each container, the problem is that the Storm UI only knows about 1 global port the logviewer, so you cannot just direct. h3. Possible Solutions I can envision # configure the Storm workers to write to log directories that exist on the raw host outside of the container sandbox, and run a single logviewer on a host, which serves up the contents of that directory. #* violates one of the basic reasons for using containers: isolation. #* also prevents allow a standard use case for Mesos: running more than 1 instance of a Mesos Framework (e.g., "Storm Cluster") at once on same Mesos Cluster. e.g., for Blue-Green deployments. #* a variation on this proposal is to somehow expose the sandbox dirs of all storm containers to this singleton logviewer process (still has above problems) # launch a separate logviewers in each container, and somehow register those logviewers with Storm such that Storm knows for a given host which logviewer port is assigned to a given topology. #* this is the proposed solution h3. Storm Changes for the Proposed Solution Nimbus or ZooKeeper could serve as a registrar, recording the association between a slot (host + worker port) and the logviewer port that is serving the workers logs. And the Storm-on-Mesos framework could update this registry when launching a new worker. (This proposal definitely calls for thorough vetting and thinking.) h3. Storm-on-Mesos Framework Changes for the Proposed Solution Along with the interaction with the "registrar" proposed above, the storm-on-mesos framework can be enhanced to launch multiple logviewers on a given worker host, where each logviewer is dedicated to serving the worker logs from a specific topology's container/sandbox directory. This would be done by launching a logviewer process within the topology's container, and assigning it an arbitrary listening port that has been determined dynamically through mesos (which treats ports as one of the schedulable resource primitives of a worker host). [Code implementing this logviewer-port-allocation logic already exists|https://github.com/mesos/storm/commit/af8c49beac04b530c33c1401c829caaa8e368a35], but [that specific portion of the code was reverted|https://github.com/mesos/storm/commit/dc3eee0f0e9c06f6da7b2fe697a8e4fc05b5227e] because of the issues that inspired this ticket. was: h3. Storm-on-Mesos Worker Logs are in varying directories When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each topology's workers are isolated into separate containers. By default the worker logs will be saved into container-specific sandbox directories. These directories are also topology-specific by definition, because, as just stated, the containers are specific to each topology. h3. Problem: Storm supports 1-and-only-1 Logviewer per Worker Host A challenge with this different way of running Storm is that the [Storm logviewer|https://github.com/apache/storm/blob/768a85926373355c15cc139fd86268916abc6850/docs/_posts/2013-12-08-storm090-released.md#log-viewer-ui] runs as a single instance on each worker host. This doesn't play well with having the topology worker logs in separate per-topology containers. The one logviewer doesn't know about the various sandbox directories that the Storm Workers are writing to. And if we just spawned new logviewers for each container, the problem is that the Storm UI only knows about 1 global port the logviewer, so you cannot just direct. h3. Possible Solutions I can envision # configure the Storm workers to write to log directories that exist on the raw host outside of the container sandbox, and run a single logviewer on a host, which serves up the contents of that directory. #* violates one of the basic reasons for
[jira] [Created] (STORM-1216) button to kill all topologies in Storm UI
Erik Weathers created STORM-1216: Summary: button to kill all topologies in Storm UI Key: STORM-1216 URL: https://issues.apache.org/jira/browse/STORM-1216 Project: Apache Storm Issue Type: Wish Components: storm-core Affects Versions: 0.11.0 Reporter: Erik Weathers Priority: Minor In the Storm-on-Mesos project we had a [request to have an ability to "shut down the storm cluster" via a UI button|https://github.com/mesos/storm/issues/46]. That could be accomplished via a button in the Storm UI to kill all of the topologies. I understand if this is viewed as an undesirable feature, but I just wanted to document the request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1027) Topology may hang because metric-tick function is a blocking call from spout
[ https://issues.apache.org/jira/browse/STORM-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-1027: - Affects Version/s: 0.9.5 Fix Version/s: 0.9.6 > Topology may hang because metric-tick function is a blocking call from spout > > > Key: STORM-1027 > URL: https://issues.apache.org/jira/browse/STORM-1027 > Project: Apache Storm > Issue Type: Bug > Components: storm-core >Affects Versions: 0.10.0, 0.9.5 >Reporter: Abhishek Agarwal >Assignee: Abhishek Agarwal >Priority: Critical > Fix For: 0.10.0, 0.9.6 > > > Nathan had fixed the dining philosopher problem by putting a overflow buffer > in the spout so that spout is not blocking. However, overflow buffer is not > used when emitting the metric, and that could result in the deadlock. I > modified the executor to use overflow buffer for emitting metrics and > afterwards topology didn't hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1056) allow supervisor log filename to be configurable via ENV variable
[ https://issues.apache.org/jira/browse/STORM-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994381#comment-14994381 ] Erik Weathers commented on STORM-1056: -- [~kabhwan]: seems this didn't get put into the 0.10.0 release as I had expected. :-( Can you please ensure it's in the train for 0.10.1? > allow supervisor log filename to be configurable via ENV variable > - > > Key: STORM-1056 > URL: https://issues.apache.org/jira/browse/STORM-1056 > Project: Apache Storm > Issue Type: Task > Components: storm-core >Reporter: Erik Weathers >Assignee: Erik Weathers >Priority: Minor > Fix For: 0.9.6 > > > *Requested feature:* allow configuring the supervisor's log filename when > launching it via an ENV variable. > *Motivation:* The storm-on-mesos project (https://github.com/mesos/storm) > relies on multiple Storm Supervisor processes per worker host, where each > Supervisor is dedicated to a particular topology. This is part of the > framework's functionality of separating topologies from each other. i.e., > storm-on-mesos is a multi-tenant system. But before the change requested in > this issue, the logs from all supervisors on a worker host will be written > into a supervisor log with a single name of supervisor.log. If all logs are > written to a common location on the mesos host, then all logs go to the same > log file. Instead it would be desirable to separate the supervisor logs > per-topology, so that each tenant/topology-owner can peruse the logs that are > related to their own topology. Thus this ticket is requesting the ability to > configure the supervisor log via an environment variable whilst invoking > bin/storm.py (or bin/storm in pre-0.10 storm releases). > When this ticket is fixed, we will include the topology ID into the > supervisor log filename for storm-on-mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-763) nimbus reassigned worker A to another machine, but other worker's netty client can't connect to the new worker A
[ https://issues.apache.org/jira/browse/STORM-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-763: Description: Debian 3.16.3-2~bpo70+1 (2014-09-21) x86_64 GNU/Linux java version "1.7.0_03" storm 0.9.4 cluster 50+ machines my topology have 50+ worker, it can't emit 5 thousand tuples in ten minutes. sometimes one worker is reassigned to another machine by nimbus because of task heartbeat timeout: {code} 2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor my_topology-22-1428243953:[440 440] not alive 2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor my_topology-22-1428243953:[90 90] not alive 2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor my_topology-22-1428243953:[510 510] not alive 2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor my_topology-22-1428243953:[160 160] not alive {code} i can see the reassigned worker is already started in storm UI, but other worker write error log all the time: {code} 2015-04-08T16:56:43.091+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:45.660+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:45.660+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:45.715+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:45.716+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:46.277+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:46.278+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:46.306+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:46.306+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:46.586+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:46.586+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:46.835+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable {code} The worker of destined host is already started, and i can telnet 192.168.163.19 5700. however, why the netty client can't connect to the ip:port? was: Debian 3.16.3-2~bpo70+1 (2014-09-21) x86_64 GNU/Linux java version "1.7.0_03" storm 0.9.4 cluster 50+ machines my topology have 50+ worker, it can't emit 5 thousand tuples in ten minutes. sometimes one worker is reassigned to another machine by nimbus because of task heartbeat timeout: 2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor my_topology-22-1428243953:[440 440] not alive 2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor my_topology-22-1428243953:[90 90] not alive 2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor my_topology-22-1428243953:[510 510] not alive 2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor my_topology-22-1428243953:[160 160] not alive i can see the reassigned worker is already started in storm UI, but other worker write error log all the time: 2015-04-08T16:56:43.091+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:45.660+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:45.660+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:45.715+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:45.716+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:46.277+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:46.278+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:46.306+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:46.306+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:46.586+0800 b.s.m.n.Client [ERROR] connection to Netty-Client-host_19/192.168.163.19:5700 is unavailable 2015-04-08T16:56:46.586+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) destined for Netty-Client-host_19/192.168.163.19:5700 2015-04-08T16:56:46.835+0800 b.s.m.n.Client
[jira] [Updated] (STORM-107) Add better ways to construct topologies
[ https://issues.apache.org/jira/browse/STORM-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-107: Description: https://github.com/nathanmarz/storm/issues/649 AFAIK the only way to construct a topology is to manually wire them together, e.g. {code} (topology {"firehose" (spout-spec firehose-spout)} {"our-bolt-1" (bolt-spec {"firehose" :shuffle} some-bolt :p 5) "our-bolt-2" (bolt-spec {"our-bolt-1" ["word"]} some-other-bolt :p 6)}) {code} This sort of manual specification of edges seems a bit too 1990's for me. I would like a modular way to express topologies, so that you can compose sub-topologies together. Another benefit of an alternative to this graph setup is that ensuring that the topology is correct does not mean tracing every edge in the graph to make sure the graph is right. I am thinking maybe some sort of LINQ-style query that simply desugars to the arguments we pass into topology. For example, the following could desugar into the two map arguments we're passing to topology: {code} (def firehose (mk-spout "firehose" firehose-spout)) (def bolt1 (mk-bolt "our-bolt-1" some-bolt :p 5)) (def bolt2 (mk-bolt "our-bolt-1" some-other-bolt :p 6)) (from-in thing (compose firehose bolt1 bolt2) (select thing)) {code} Here from-in is pulling thing out of the result of compose'ing the firehose and the bolts, forming the topology we saw before. mk-spout would register a named spout spec, and the from macro would return the two dictionaries passed into topology. The specification needs a lot of work, but I'm willing to write the patch myself once it's nailed down. The question is, do you want me to write it and send it off to you, or am I going to have to build a storm-tools repot to distribute it? -- mrflip:We have an internal tool for describing topologies at a high level, and though it hasn't reached production we have found: 1. it definitely makes sense to have one set of objects that describe topologies, and a different set of objects that express them. 2. it probably makes sense to have those classes generate a static manifest: a lifeless JSON representation of a topology. To the first point, initially we did it like storm: the FooEacher class would know how to wire itself into a topology(), and also know how to Foo each record that it received. We later refactored to separate topology construction from data handling: there is an EacherStage that represents anything that obeys the Each contract, so you'd say flow do source(:kafka_trident_spout) > eacher(:foo_eacher) > so_on() > and_so_forth(). The code became simpler and more powerful. () Actually in storm stages are wired into the topology, but the issue is that they're around at run-time in both cases, requiring serialization and so forth. More importantly, it's worth considering a static manifest. The virtue of a manifest is that it is universal and static. If it's a JSON file, anything can generate it and anything can consume it; that would meet the needs of external programs which want to orchestrate Storm/Trident, as well as the repeated requests to visualize a topology in the UI. Also since it's static, the worker logic can simplify as it will know the whole graph in advance. From my experience, apart from the transactional code, the topology instantiation logic is the most complicated in the joint. That feels justifiable for the transaction logic but not for the topology instantiation. The danger of a manifest is also that it is static -- you could find yourself on the primrose path to maven-style XML hell, where you wake up one day and find you've attached layers of ponderous machinery to make a static config file Turing-complete. I think the problem comes when you try to make the file human-editable. The manifest should expressly be the porcelain result of a DSL, with all decisions baked in -- it must not be a DSL. In general, we find that absolute separation of orchestration (what things should be wired together) and action (actually doing things) seems painful at design time but ends up making things simpler and more powerful. was: https://github.com/nathanmarz/storm/issues/649 AFAIK the only way to construct a topology is to manually wire them together, e.g. (topology {"firehose" (spout-spec firehose-spout)} {"our-bolt-1" (bolt-spec {"firehose" :shuffle} some-bolt :p 5) "our-bolt-2" (bolt-spec {"our-bolt-1" ["word"]} some-other-bolt :p 6)}) This sort of manual specification of edges seems a bit too 1990's for me. I would like a modular way to express topologies, so that you can
[jira] [Created] (STORM-1056) allow supervisor log filename to be configurable via ENV variable
Erik Weathers created STORM-1056: Summary: allow supervisor log filename to be configurable via ENV variable Key: STORM-1056 URL: https://issues.apache.org/jira/browse/STORM-1056 Project: Apache Storm Issue Type: Task Reporter: Erik Weathers Priority: Minor Fix For: 0.10.0, 0.11.0, 0.9.6 Requested feature: allow configuring the supervisor's log filename when launching it via an ENV variable. Motivation: The storm-on-mesos project (https://github.com/mesos/storm) relies on multiple Storm Supervisor processes per worker host, where each Supervisor is dedicated to a particular topology. This is part of the framework's functionality of separating topologies from each other. i.e., storm-on-mesos is a multi-tenant system. But before the change requested in this issue, the logs from all supervisors on a worker host will be written into a supervisor log with a single name of supervisor.log. If all logs are written to a common location on the mesos host, then all logs go to the same log file. Instead it would be desirable to separate the supervisor logs per-topology, so that each tenant/topology-owner can peruse the logs that are related to their own topology. Thus this ticket is requesting the ability to configure the supervisor log via an environment variable whilst invoking bin/storm.py (or bin/storm in pre-0.10 storm releases). When this ticket is fixed, we will include the topology ID into the supervisor log filename for storm-on-mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1056) allow supervisor log filename to be configurable via ENV variable
[ https://issues.apache.org/jira/browse/STORM-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-1056: - Description: *Requested feature:* allow configuring the supervisor's log filename when launching it via an ENV variable. *Motivation:* The storm-on-mesos project (https://github.com/mesos/storm) relies on multiple Storm Supervisor processes per worker host, where each Supervisor is dedicated to a particular topology. This is part of the framework's functionality of separating topologies from each other. i.e., storm-on-mesos is a multi-tenant system. But before the change requested in this issue, the logs from all supervisors on a worker host will be written into a supervisor log with a single name of supervisor.log. If all logs are written to a common location on the mesos host, then all logs go to the same log file. Instead it would be desirable to separate the supervisor logs per-topology, so that each tenant/topology-owner can peruse the logs that are related to their own topology. Thus this ticket is requesting the ability to configure the supervisor log via an environment variable whilst invoking bin/storm.py (or bin/storm in pre-0.10 storm releases). When this ticket is fixed, we will include the topology ID into the supervisor log filename for storm-on-mesos. was: Requested feature: allow configuring the supervisor's log filename when launching it via an ENV variable. Motivation: The storm-on-mesos project (https://github.com/mesos/storm) relies on multiple Storm Supervisor processes per worker host, where each Supervisor is dedicated to a particular topology. This is part of the framework's functionality of separating topologies from each other. i.e., storm-on-mesos is a multi-tenant system. But before the change requested in this issue, the logs from all supervisors on a worker host will be written into a supervisor log with a single name of supervisor.log. If all logs are written to a common location on the mesos host, then all logs go to the same log file. Instead it would be desirable to separate the supervisor logs per-topology, so that each tenant/topology-owner can peruse the logs that are related to their own topology. Thus this ticket is requesting the ability to configure the supervisor log via an environment variable whilst invoking bin/storm.py (or bin/storm in pre-0.10 storm releases). When this ticket is fixed, we will include the topology ID into the supervisor log filename for storm-on-mesos. > allow supervisor log filename to be configurable via ENV variable > - > > Key: STORM-1056 > URL: https://issues.apache.org/jira/browse/STORM-1056 > Project: Apache Storm > Issue Type: Task >Reporter: Erik Weathers >Priority: Minor > Fix For: 0.10.0, 0.11.0, 0.9.6 > > > *Requested feature:* allow configuring the supervisor's log filename when > launching it via an ENV variable. > *Motivation:* The storm-on-mesos project (https://github.com/mesos/storm) > relies on multiple Storm Supervisor processes per worker host, where each > Supervisor is dedicated to a particular topology. This is part of the > framework's functionality of separating topologies from each other. i.e., > storm-on-mesos is a multi-tenant system. But before the change requested in > this issue, the logs from all supervisors on a worker host will be written > into a supervisor log with a single name of supervisor.log. If all logs are > written to a common location on the mesos host, then all logs go to the same > log file. Instead it would be desirable to separate the supervisor logs > per-topology, so that each tenant/topology-owner can peruse the logs that are > related to their own topology. Thus this ticket is requesting the ability to > configure the supervisor log via an environment variable whilst invoking > bin/storm.py (or bin/storm in pre-0.10 storm releases). > When this ticket is fixed, we will include the topology ID into the > supervisor log filename for storm-on-mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1047) document internals of bin/storm.py
[ https://issues.apache.org/jira/browse/STORM-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-1047: - Description: The `python` script `bin/storm.py` is completely undocumented regarding its internals. Function comments only include a command line interface often omitting an explanation of arguments and their default values (e.g. it should be clear why the default value of `klass` of `nimbus` is `"backtype.storm.daemon.nimbus"` because that doesn't make sense to someone unfamiliar with the storm-core implementation). Also explanations like "Launches the nimbus daemon. [...]" (again `nimbus` function) is good for a command line API doc, but insufficient for a function documentation (should mention that it starts a `java` process and passes `klass` as class name to it). How does the script use `lib/`, `extlib/` and `extlib-daemon`? It's too complex to squeeze this info out of the source code. was: The `python` script `bin/storm.py` is completely undocumented regarding it's internals. Function comments only include a command line interface often omitting an explanation of arguments and their default values (e.g. it should be clear why the default value of `klass` of `nimbus` is `"backtype.storm.daemon.nimbus"` because that doesn't make sense). Also explanations like "Launches the nimbus daemon. [...]" (again `nimbus` function) is good for a command line API doc, but insufficient for a function documentation (should mention that it starts a `java` process and passes `klass` as class name to it). How does the script use `lib/`, `extlib/` and `extlib-daemon`? It's too complex to squeeze this info out of the source code. > document internals of bin/storm.py > -- > > Key: STORM-1047 > URL: https://issues.apache.org/jira/browse/STORM-1047 > Project: Apache Storm > Issue Type: Documentation >Affects Versions: 0.10.0 >Reporter: Karl Richter > Labels: documentation > > The `python` script `bin/storm.py` is completely undocumented regarding its > internals. Function comments only include a command line interface often > omitting an explanation of arguments and their default values (e.g. it should > be clear why the default value of `klass` of `nimbus` is > `"backtype.storm.daemon.nimbus"` because that doesn't make sense to someone > unfamiliar with the storm-core implementation). > Also explanations like "Launches the nimbus daemon. [...]" (again `nimbus` > function) is good for a command line API doc, but insufficient for a function > documentation (should mention that it starts a `java` process and passes > `klass` as class name to it). > How does the script use `lib/`, `extlib/` and `extlib-daemon`? It's too > complex to squeeze this info out of the source code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (STORM-1043) Concurrent access to state on local FS by multiple supervisors
[ https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers reopened STORM-1043: -- Assignee: Erik Weathers > Concurrent access to state on local FS by multiple supervisors > -- > > Key: STORM-1043 > URL: https://issues.apache.org/jira/browse/STORM-1043 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.9.5 >Reporter: Ernestas Vaiciukevičius >Assignee: Erik Weathers > Labels: mesosphere > > Hi, > we are running storm-mesos cluster and occassionaly workers die or are "lost" > in mesos. When this happens it often coincides with errors in logs related to > supervisors local state. > By looking at the storm code it seems this might be caused by the way how > multiple supervisor processes access the local state in the same directory > via VersionedStore. > For example: > https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434 > Here every supervisor does this concurrently: > 1. reads latest state from FS > 2. possibly updates the state > 3. writes the new version of the state > Some updates could be lost if there are 2+ supervisors and they execute above > steps concurrently - then only the updates from last supervisor would remain > on the last state version on the disk. > We observed local state changes quite often (seconds), so the likelihood of > this concurrency issue occurring is high. > Some examples of exeptions: > -- > java.lang.RuntimeException: Version already exists or data already exists > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.persist(LocalState.java:101) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:82) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:76) > ~[storm-core-0.9.5.jar:0.9.5] > at > backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > --- > java.io.FileNotFoundException: File > '/var/lib/storm/supervisor/localstate/1441034838231' does not exist > at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) > ~[commons-io-2.4.jar:2.4] > at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) > ~[commons-io-2.4.jar:2.4] > at > backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.get(LocalState.java:72) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na] > at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na] > at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na] > at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na] > at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > - -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (STORM-1043) Concurrent access to state on local FS by multiple supervisors
[ https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743763#comment-14743763 ] Erik Weathers edited comment on STORM-1043 at 9/14/15 4:18 PM: --- As I replied in the issue that [~ernisv] raised (https://github.com/mesos/storm/issues/60), the solution is to leverage mesos's ability to put framework's data into separate sandboxes. Just *don't* set {{storm.local.dir}} and the cwd of the Mesos Executor will be used for the Supervisor, which will be in the supervisor-specific sandbox. FYI [~revans2], the ports are taken care of automatically by Mesos's scheduler/offer system, as they are considered part of the resources that each topology is claiming on the Mesos worker nodes ("mesos-slave" has now been renamed as "mesos-agent"). was (Author: erikdw): As I replied in the issue that [~ernisv] raised (), the solution is to leverage mesos's ability to put framework's data into separate sandboxes. Just *don't* set {storm.local.dir} and the cwd of the Mesos Executor will be used for the Supervisor, which will be in the supervisor-specific sandbox. FYI [~revans2], the ports are taken care of automatically by Mesos's scheduler/offer system, as they are considered part of the resources that each topology is claiming on the Mesos worker nodes ("mesos-slave" has now been renamed as "mesos-agent"). > Concurrent access to state on local FS by multiple supervisors > -- > > Key: STORM-1043 > URL: https://issues.apache.org/jira/browse/STORM-1043 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.9.5 >Reporter: Ernestas Vaiciukevičius >Assignee: Erik Weathers > Labels: mesosphere > > Hi, > we are running storm-mesos cluster and occassionaly workers die or are "lost" > in mesos. When this happens it often coincides with errors in logs related to > supervisors local state. > By looking at the storm code it seems this might be caused by the way how > multiple supervisor processes access the local state in the same directory > via VersionedStore. > For example: > https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434 > Here every supervisor does this concurrently: > 1. reads latest state from FS > 2. possibly updates the state > 3. writes the new version of the state > Some updates could be lost if there are 2+ supervisors and they execute above > steps concurrently - then only the updates from last supervisor would remain > on the last state version on the disk. > We observed local state changes quite often (seconds), so the likelihood of > this concurrency issue occurring is high. > Some examples of exeptions: > -- > java.lang.RuntimeException: Version already exists or data already exists > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.persist(LocalState.java:101) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:82) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:76) > ~[storm-core-0.9.5.jar:0.9.5] > at > backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > --- > java.io.FileNotFoundException: File > '/var/lib/storm/supervisor/localstate/1441034838231' does not exist > at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) > ~[commons-io-2.4.jar:2.4] > at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) > ~[commons-io-2.4.jar:2.4] > at > backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.get(LocalState.java:72) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na] > at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na] > at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na] > at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na] > at
[jira] [Closed] (STORM-1043) Concurrent access to state on local FS by multiple supervisors
[ https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers closed STORM-1043. Resolution: Invalid > Concurrent access to state on local FS by multiple supervisors > -- > > Key: STORM-1043 > URL: https://issues.apache.org/jira/browse/STORM-1043 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.9.5 >Reporter: Ernestas Vaiciukevičius >Assignee: Erik Weathers > Labels: mesosphere > > Hi, > we are running storm-mesos cluster and occassionaly workers die or are "lost" > in mesos. When this happens it often coincides with errors in logs related to > supervisors local state. > By looking at the storm code it seems this might be caused by the way how > multiple supervisor processes access the local state in the same directory > via VersionedStore. > For example: > https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434 > Here every supervisor does this concurrently: > 1. reads latest state from FS > 2. possibly updates the state > 3. writes the new version of the state > Some updates could be lost if there are 2+ supervisors and they execute above > steps concurrently - then only the updates from last supervisor would remain > on the last state version on the disk. > We observed local state changes quite often (seconds), so the likelihood of > this concurrency issue occurring is high. > Some examples of exeptions: > -- > java.lang.RuntimeException: Version already exists or data already exists > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.persist(LocalState.java:101) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:82) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:76) > ~[storm-core-0.9.5.jar:0.9.5] > at > backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > --- > java.io.FileNotFoundException: File > '/var/lib/storm/supervisor/localstate/1441034838231' does not exist > at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) > ~[commons-io-2.4.jar:2.4] > at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) > ~[commons-io-2.4.jar:2.4] > at > backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.get(LocalState.java:72) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na] > at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na] > at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na] > at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na] > at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > - -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1043) Concurrent access to state on local FS by multiple supervisors
[ https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743763#comment-14743763 ] Erik Weathers commented on STORM-1043: -- As I replied in the issue that [~ernisv] raised (), the solution is to leverage mesos's ability to put framework's data into separate sandboxes. Just *don't* set {storm.local.dir} and the cwd of the Mesos Executor will be used for the Supervisor, which will be in the supervisor-specific sandbox. FYI [~revans2], the ports are taken care of automatically by Mesos's scheduler/offer system, as they are considered part of the resources that each topology is claiming on the Mesos worker nodes ("mesos-slave" has now been renamed as "mesos-agent"). > Concurrent access to state on local FS by multiple supervisors > -- > > Key: STORM-1043 > URL: https://issues.apache.org/jira/browse/STORM-1043 > Project: Apache Storm > Issue Type: Bug >Affects Versions: 0.9.5 >Reporter: Ernestas Vaiciukevičius > Labels: mesosphere > > Hi, > we are running storm-mesos cluster and occassionaly workers die or are "lost" > in mesos. When this happens it often coincides with errors in logs related to > supervisors local state. > By looking at the storm code it seems this might be caused by the way how > multiple supervisor processes access the local state in the same directory > via VersionedStore. > For example: > https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434 > Here every supervisor does this concurrently: > 1. reads latest state from FS > 2. possibly updates the state > 3. writes the new version of the state > Some updates could be lost if there are 2+ supervisors and they execute above > steps concurrently - then only the updates from last supervisor would remain > on the last state version on the disk. > We observed local state changes quite often (seconds), so the likelihood of > this concurrency issue occurring is high. > Some examples of exeptions: > -- > java.lang.RuntimeException: Version already exists or data already exists > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.persist(LocalState.java:101) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:82) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.put(LocalState.java:76) > ~[storm-core-0.9.5.jar:0.9.5] > at > backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > --- > java.io.FileNotFoundException: File > '/var/lib/storm/supervisor/localstate/1441034838231' does not exist > at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) > ~[commons-io-2.4.jar:2.4] > at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) > ~[commons-io-2.4.jar:2.4] > at > backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.utils.LocalState.get(LocalState.java:72) > ~[storm-core-0.9.5.jar:0.9.5] > at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na] > at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na] > at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na] > at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na] > at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na] > at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) > ~[storm-core-0.9.5.jar:0.9.5] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > - -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-188) Allow user to specify full configuration path when running storm command
[ https://issues.apache.org/jira/browse/STORM-188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated STORM-188: Summary: Allow user to specify full configuration path when running storm command (was: Allow user to specifiy full configuration path when running storm command) Allow user to specify full configuration path when running storm command Key: STORM-188 URL: https://issues.apache.org/jira/browse/STORM-188 Project: Apache Storm Issue Type: Bug Reporter: Sean Zhong Priority: Minor Attachments: search_local_path_for_config.patch, storm-188.patch Original Estimate: 168h Remaining Estimate: 168h Currently, storm will only look up configuration path in java classpath. We should also allow user to specify full configuration path. This is very important for a shared cluster environment, like YARN. Multiple storm cluster may runs with different configuration, but share same binary folder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)