from:"Erik Weathers \(JIRA\)"

[jira] [Commented] (STORM-1839) Kinesis Spout

2016-05-16 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285208#comment-15285208
 ] 

Erik Weathers commented on STORM-1839:
--

Thanks [~sriharsha]!

> Kinesis Spout
> -
>
> Key: STORM-1839
> URL: https://issues.apache.org/jira/browse/STORM-1839
> Project: Apache Storm
>  Issue Type: Improvement
>Reporter: Sriharsha Chintalapani
>Assignee: Priyank Shah
>
> As Storm is increasingly used in Cloud environments. It will great to have a 
> Kinesis Spout integration in Apache Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-1766) A better algorithm server rack selection for RAS

2016-05-04 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-1766:
-
Description: 
Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack 
based on which rack has the most available resources this may be insufficient 
and may cause topologies not to be able to be scheduled successfully even 
though there are enough resources to schedule it in the cluster. We attempt to 
find the rack with the most resources by find the rack with the biggest sum of 
available memory + available cpu. This method is not effective since it does 
not consider the number of slots available. This method also fails in 
identifying racks that are not schedulable due to the exhaustion of one of the 
resources either memory, cpu, or slots. The current implementation also tries 
the initial scheduling on one rack and not try to schedule on all the racks 
before giving up which may cause topologies to be failed to be scheduled due to 
the above mentioned shortcomings in the current method. Also the current method 
does not consider failures of workers. When executors of a topology gets 
unassigned and needs to be scheduled again, the current logic in 
getBestClustering may be inadequate if not complete wrong. When executors needs 
to rescheduled due to a fault, getBestClustering will likely return a cluster 
that is different from where the majority of executors from the topology is 
originally scheduling in.

Thus, I propose a different strategy/algorithm to find the "best" cluster. I 
have come up with a ordering strategy I dub subordinate resource availability 
ordering (inspired by Dominant Resource Fairness) that sorts racks by the 
subordinate (not dominant) resource availability.

For example given 4 racks with the following resource availabilities
{code}
//generate some that has alot of memory but little of cpu
rack-3 Avail [ CPU 100.0 MEM 20.0 Slots 40 ] Total [ CPU 100.0 MEM 20.0 
Slots 40 ]
//generate some supervisors that are depleted of one resource
rack-2 Avail [ CPU 0.0 MEM 8.0 Slots 40 ] Total [ CPU 0.0 MEM 8.0 Slots 
40 ]
//generate some that has a lot of cpu but little of memory
rack-4 Avail [ CPU 6100.0 MEM 1.0 Slots 40 ] Total [ CPU 6100.0 MEM 1.0 
Slots 40 ]
//generate another rack of supervisors with less resources than rack-0
rack-1 Avail [ CPU 2000.0 MEM 4.0 Slots 40 ] Total [ CPU 2000.0 MEM 4.0 
Slots 40 ]
rack-0 Avail [ CPU 4000.0 MEM 8.0 Slots 40( ] Total [ CPU 4000.0 MEM 
8.0 Slots 40 ]
Cluster Overall Avail [ CPU 12200.0 MEM 41.0 Slots 200 ] Total [ CPU 
12200.0 MEM 41.0 Slots 200 ]
{code}

It is clear that rack-0 is the best cluster since its the most balanced and can 
potentially schedule the most executors, while rack-2 is the worst rack since 
rack-2 is depleted of cpu resource thus rendering it unschedulable even though 
there are other resources available.

We first calculate the resource availability percentage of all the racks for 
each resource by computing:
{code}
(resource available on rack) / (resource available in cluster)
{code}

We do this calculation to normalize the values otherwise the resource values 
would not be comparable.

So for our example:
{code}
rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] 
effective resources: 0.00819672131147541
rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: 
0.0
rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective 
resources: 0.024390243902439025
rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] 
effective resources: 0.0975609756097561
rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] 
effective resources: 0.1951219512195122
{code}

The effective resource of a rack, which is also the subordinate resource, is 
computed by: 
{code}
MIN(resource availability percentage of {CPU, Memory, # of free Slots}).
{code}
Then we order the racks by the effective resource.

Thus for our example:
{code}
Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2]
{code}
Also to deal with the presence of failures, if a topology is partially 
scheduled, we find the rack with the most scheduled executors for the topology 
and we try to schedule on that rack first.

Thus for the sorting for racks. We first sort by the number of executors 
already scheduled on the rack and then by the subordinate resource availability.

  was:
Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack 
based on which rack has the most available resources this may be insufficient 
and may cause topologies not to be able to be scheduled successfully even 
though there are enough resources to schedule it in the cluster. We attempt to 
find the rack with the most resources by find the rack with the biggest sum of 
available memory + available cpu. This method

[jira] [Updated] (STORM-1766) A better algorithm server rack selection for RAS

2016-05-04 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-1766:
-
Description: 
Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack 
based on which rack has the most available resources this may be insufficient 
and may cause topologies not to be able to be scheduled successfully even 
though there are enough resources to schedule it in the cluster. We attempt to 
find the rack with the most resources by find the rack with the biggest sum of 
available memory + available cpu. This method is not effective since it does 
not consider the number of slots available. This method also fails in 
identifying racks that are not schedulable due to the exhaustion of one of the 
resources either memory, cpu, or slots. The current implementation also tries 
the initial scheduling on one rack and not try to schedule on all the racks 
before giving up which may cause topologies to be failed to be scheduled due to 
the above mentioned shortcomings in the current method. Also the current method 
does not consider failures of workers. When executors of a topology gets 
unassigned and needs to be scheduled again, the current logic in 
getBestClustering may be inadequate if not complete wrong. When executors needs 
to rescheduled due to a fault, getBestClustering will likely return a cluster 
that is different from where the majority of executors from the topology is 
originally scheduling in.

Thus, I propose a different strategy/algorithm to find the "best" cluster. I 
have come up with a ordering strategy I dub subordinate resource availability 
ordering (inspired by Dominant Resource Fairness) that sorts racks by the 
subordinate (not dominant) resource availability.

For example given 4 racks with the following resource availabilities
{code}
//generate some that has alot of memory but little of cpu
rack-3 Avail [ CPU 100.0 MEM 20.0 Slots 40 ] Total [ CPU 100.0 MEM 20.0 
Slots 40 ]
//generate some supervisors that are depleted of one resource
rack-2 Avail [ CPU 0.0 MEM 8.0 Slots 40 ] Total [ CPU 0.0 MEM 8.0 Slots 
40 ]
//generate some that has a lot of cpu but little of memory
rack-4 Avail [ CPU 6100.0 MEM 1.0 Slots 40 ] Total [ CPU 6100.0 MEM 1.0 
Slots 40 ]
//generate another rack of supervisors with less resources than rack-0
rack-1 Avail [ CPU 2000.0 MEM 4.0 Slots 40 ] Total [ CPU 2000.0 MEM 4.0 
Slots 40 ]
rack-0 Avail [ CPU 4000.0 MEM 8.0 Slots 40( ] Total [ CPU 4000.0 MEM 
8.0 Slots 40 ]
Cluster Overall Avail [ CPU 12200.0 MEM 41.0 Slots 200 ] Total [ CPU 
12200.0 MEM 41.0 Slots 200 ]
{code}

It is clear that rack-0 is the best cluster since its the most balanced and can 
potentially schedule the most executors, while rack-2 is the worst rack since 
rack-2 is depleted of cpu resource thus rendering it unschedulable even though 
there are other resources available.

We first calculate the resource availability percentage of all the racks for 
each resource by computing: (resource available on rack) / (resource available 
in cluster)

We do this calculation to normalize the values otherwise the resource values 
would not be comparable.

So for our example:
{code}
rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] 
effective resources: 0.00819672131147541
rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: 
0.0
rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective 
resources: 0.024390243902439025
rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] 
effective resources: 0.0975609756097561
rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] 
effective resources: 0.1951219512195122
{code}

The effective resource of a rack, which is also the subordinate resource, is 
computed by: 
{code}
MIN(resource availability percentage of {CPU, Memory, # of free Slots}).
{code}
Then we order the racks by the effective resource.

Thus for our example:
{code}
Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2]
{code}
Also to deal with the presence of failures, if a topology is partially 
scheduled, we find the rack with the most scheduled executors for the topology 
and we try to schedule on that rack first.

Thus for the sorting for racks. We first sort by the number of executors 
already scheduled on the rack and then by the subordinate resource availability.

  was:
Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack 
based on which rack has the most available resources this may be insufficient 
and may cause topologies not to be able to be scheduled successfully even 
though there are enough resources to schedule it in the cluster. We attempt to 
find the rack with the most resources by find the rack with the biggest sum of 
available memory + available cpu. This method is not

[jira] [Resolved] (STORM-143) Launching a process throws away standard out; can hang

2016-04-27 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers resolved STORM-143.
-
   Resolution: Fixed
Fix Version/s: 0.10.0

> Launching a process throws away standard out; can hang
> --
>
> Key: STORM-143
> URL: https://issues.apache.org/jira/browse/STORM-143
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Reporter: James Xu
>Priority: Minor
> Fix For: 0.10.0
>
>
> https://github.com/nathanmarz/storm/issues/489
> https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/util.clj#L349
> When we launch a process, standard out is written to a system buffer and does 
> not appear to be read. Also, nothing is redirected to standard in. This can 
> have the following effects:
> A worker can hang when initializing (e.g. UnsatisfiedLinkError looking for 
> jzmq), and it will be unable to communicate the error as standard out is 
> being swallowed.
> A process that writes too much to standard out will block if the buffer fills
> A process that tries to read form standard in for any reason will block.
> Perhaps we can redirect standard out to an .out file, and redirect /dev/null 
> to the standard in stream of the process?
> --
> nathanmarz: Storm redirects stdout to the logging system. It's worked fine 
> for us in our topologies.
> --
> d2r: We see in worker.clj, in mk-worker, where there is a call to 
> redirect-stdio-to-slf4j!. This would not seem to help in cases such as we are 
> seeing when there is a problem launching the worker itself.
> (defn -main [storm-id assignment-id port-str worker-id]
>   (let [conf1 (read-storm-config)
> login_conf_file (System/getProperty "java.security.auth.login.config")
> conf (if login_conf_file (merge conf1 
> {"java.security.auth.login.config" login_conf_file}) conf1)]
> (validate-distributed-mode! conf)
> (mk-worker conf nil (java.net.URLDecoder/decode storm-id) assignment-id 
> (Integer/parseInt port-str) worker-id)))
> If anything were to go wrong (CLASSPATH, jvm opts, misconfiguration...) 
> before -main or before mk-worker, then any output would be lost. The symptom 
> we saw was that the topology sat around apparently doing nothing, yet there 
> was no log indicating that the workers were failing to start.
> Is there other redirection to logs that I'm missing?
> --
> xiaokang: we use bash to launch worker process and redirect its stdout to 
> woker-port.out file. it heleped us find the zeromq jni problem that cause the 
> jvm crash without any log.
> --
> nathanmarz: @d2r Yea, that's all I was referring to. If we redirect stdout, 
> will the code that redirects stdout to the logging system still take effect? 
> This is important because we can control the size of the logfiles (via the 
> logback config) but not the size of the redirected stdout file.
> --
> d2r: My hunch is that it will work as it does now, except that any messages 
> that are getting thrown away before that point would go to a file instead. I 
> can play with it and find out. We wouldn't want to change the redirection, 
> just restore visibility to any output that might occur prior to the 
> redirection. There should be some safety valve to control the size of any new 
> .out in case something goes berserk.
> @xiaokang I see how that would work. We also need to make sure redirection 
> continues to work as it currently does for the above reason.
> --
> xiaokang: @d2r @nathanmarz In out cluster, storm's stdout redirection still 
> works for any System.out output while JNI errors goes to worker-port.out 
> file. I think it will be nice to use the same worker-port.log file for bash 
> stdout redirection since logback can control log file size. But it is a 
> little bit ugly to use bash to launch worker java process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-143) Launching a process throws away standard out; can hang

2016-04-27 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261323#comment-15261323
 ] 

Erik Weathers commented on STORM-143:
-

Aha (hadn't clicked the {{...}} on the GitHub UI)!  I'll mark this ticket as 
closed then, thanks!

> Launching a process throws away standard out; can hang
> --
>
> Key: STORM-143
> URL: https://issues.apache.org/jira/browse/STORM-143
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Reporter: James Xu
>Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/489
> https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/util.clj#L349
> When we launch a process, standard out is written to a system buffer and does 
> not appear to be read. Also, nothing is redirected to standard in. This can 
> have the following effects:
> A worker can hang when initializing (e.g. UnsatisfiedLinkError looking for 
> jzmq), and it will be unable to communicate the error as standard out is 
> being swallowed.
> A process that writes too much to standard out will block if the buffer fills
> A process that tries to read form standard in for any reason will block.
> Perhaps we can redirect standard out to an .out file, and redirect /dev/null 
> to the standard in stream of the process?
> --
> nathanmarz: Storm redirects stdout to the logging system. It's worked fine 
> for us in our topologies.
> --
> d2r: We see in worker.clj, in mk-worker, where there is a call to 
> redirect-stdio-to-slf4j!. This would not seem to help in cases such as we are 
> seeing when there is a problem launching the worker itself.
> (defn -main [storm-id assignment-id port-str worker-id]
>   (let [conf1 (read-storm-config)
> login_conf_file (System/getProperty "java.security.auth.login.config")
> conf (if login_conf_file (merge conf1 
> {"java.security.auth.login.config" login_conf_file}) conf1)]
> (validate-distributed-mode! conf)
> (mk-worker conf nil (java.net.URLDecoder/decode storm-id) assignment-id 
> (Integer/parseInt port-str) worker-id)))
> If anything were to go wrong (CLASSPATH, jvm opts, misconfiguration...) 
> before -main or before mk-worker, then any output would be lost. The symptom 
> we saw was that the topology sat around apparently doing nothing, yet there 
> was no log indicating that the workers were failing to start.
> Is there other redirection to logs that I'm missing?
> --
> xiaokang: we use bash to launch worker process and redirect its stdout to 
> woker-port.out file. it heleped us find the zeromq jni problem that cause the 
> jvm crash without any log.
> --
> nathanmarz: @d2r Yea, that's all I was referring to. If we redirect stdout, 
> will the code that redirects stdout to the logging system still take effect? 
> This is important because we can control the size of the logfiles (via the 
> logback config) but not the size of the redirected stdout file.
> --
> d2r: My hunch is that it will work as it does now, except that any messages 
> that are getting thrown away before that point would go to a file instead. I 
> can play with it and find out. We wouldn't want to change the redirection, 
> just restore visibility to any output that might occur prior to the 
> redirection. There should be some safety valve to control the size of any new 
> .out in case something goes berserk.
> @xiaokang I see how that would work. We also need to make sure redirection 
> continues to work as it currently does for the above reason.
> --
> xiaokang: @d2r @nathanmarz In out cluster, storm's stdout redirection still 
> works for any System.out output while JNI errors goes to worker-port.out 
> file. I think it will be nice to use the same worker-port.log file for bash 
> stdout redirection since logback can control log file size. But it is a 
> little bit ugly to use bash to launch worker java process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1733) Logs from bin/storm are lost because stdout and stderr are not flushed

2016-04-27 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261166#comment-15261166
 ] 

Erik Weathers commented on STORM-1733:
--

[gigantic auto-comment 
above|https://issues.apache.org/jira/browse/STORM-1733?focusedCommentId=15261156=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15261156]
 is an example of why I want to disable the automatic uploading of all GitHub 
stuff into JIRA.

> Logs from bin/storm are lost because stdout and stderr are not flushed
> --
>
> Key: STORM-1733
> URL: https://issues.apache.org/jira/browse/STORM-1733
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.9.3, 0.10.0, 0.9.4, 0.9.5, 0.9.6
>Reporter: Karthick Duraisamy Soundararaj
>Assignee: Karthick Duraisamy Soundararaj
>
> bin/storm.py emits the following crucial information that is lost because we 
> don't flush the stdout before exec.
> {code}
> 2016-04-25T08:23:43.17141 Running: java -server -Dstorm.options= 
> -Dstorm.home= -Xmx1024m -Dlogfile.name=nimbus.log 
> -Dlogback.configurationFile=logback/cluster.xml  backtype.storm.ui.core.nimbus
> {code}
> Observed Environment:
> {code}
> OS: CentOS release 6.5 
> Kernel: 2.6.32-431.el6.x86_64
> Python version: Python 2.7.2
> {code}
> For example, I using runit to start storm components like nimbus, ui, etc and 
> the problem is applicable to all the components and in all the cases, I am 
> not seeing logs that are emitted by bin/storm before {{os.execvp}} is called 
> to actually launch the component. 
> Please note that in cases where stdout and stderr is terminal, the stdout and 
> stderr are always flushed and the bug is not applicable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-143) Launching a process throws away standard out; can hang

2016-04-27 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259806#comment-15259806
 ] 

Erik Weathers commented on STORM-143:
-

[~revans2] : seems this issue is fixed with the LogWriter that was introduced 
in storm-0.10.0.  I cannot find a ticket for that feature to link this against 
though.

> Launching a process throws away standard out; can hang
> --
>
> Key: STORM-143
> URL: https://issues.apache.org/jira/browse/STORM-143
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Reporter: James Xu
>Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/489
> https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/util.clj#L349
> When we launch a process, standard out is written to a system buffer and does 
> not appear to be read. Also, nothing is redirected to standard in. This can 
> have the following effects:
> A worker can hang when initializing (e.g. UnsatisfiedLinkError looking for 
> jzmq), and it will be unable to communicate the error as standard out is 
> being swallowed.
> A process that writes too much to standard out will block if the buffer fills
> A process that tries to read form standard in for any reason will block.
> Perhaps we can redirect standard out to an .out file, and redirect /dev/null 
> to the standard in stream of the process?
> --
> nathanmarz: Storm redirects stdout to the logging system. It's worked fine 
> for us in our topologies.
> --
> d2r: We see in worker.clj, in mk-worker, where there is a call to 
> redirect-stdio-to-slf4j!. This would not seem to help in cases such as we are 
> seeing when there is a problem launching the worker itself.
> (defn -main [storm-id assignment-id port-str worker-id]
>   (let [conf1 (read-storm-config)
> login_conf_file (System/getProperty "java.security.auth.login.config")
> conf (if login_conf_file (merge conf1 
> {"java.security.auth.login.config" login_conf_file}) conf1)]
> (validate-distributed-mode! conf)
> (mk-worker conf nil (java.net.URLDecoder/decode storm-id) assignment-id 
> (Integer/parseInt port-str) worker-id)))
> If anything were to go wrong (CLASSPATH, jvm opts, misconfiguration...) 
> before -main or before mk-worker, then any output would be lost. The symptom 
> we saw was that the topology sat around apparently doing nothing, yet there 
> was no log indicating that the workers were failing to start.
> Is there other redirection to logs that I'm missing?
> --
> xiaokang: we use bash to launch worker process and redirect its stdout to 
> woker-port.out file. it heleped us find the zeromq jni problem that cause the 
> jvm crash without any log.
> --
> nathanmarz: @d2r Yea, that's all I was referring to. If we redirect stdout, 
> will the code that redirects stdout to the logging system still take effect? 
> This is important because we can control the size of the logfiles (via the 
> logback config) but not the size of the redirected stdout file.
> --
> d2r: My hunch is that it will work as it does now, except that any messages 
> that are getting thrown away before that point would go to a file instead. I 
> can play with it and find out. We wouldn't want to change the redirection, 
> just restore visibility to any output that might occur prior to the 
> redirection. There should be some safety valve to control the size of any new 
> .out in case something goes berserk.
> @xiaokang I see how that would work. We also need to make sure redirection 
> continues to work as it currently does for the above reason.
> --
> xiaokang: @d2r @nathanmarz In out cluster, storm's stdout redirection still 
> works for any System.out output while JNI errors goes to worker-port.out 
> file. I think it will be nice to use the same worker-port.log file for bash 
> stdout redirection since logback can control log file size. But it is a 
> little bit ugly to use bash to launch worker java process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-954) Topology Event Inspector

2016-04-12 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-954:

Summary: Topology Event Inspector  (was: Toplogy Event Inspector)

> Topology Event Inspector
> 
>
> Key: STORM-954
> URL: https://issues.apache.org/jira/browse/STORM-954
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Sriharsha Chintalapani
>Assignee: Arun Mahadevan
> Fix For: 1.0.0
>
>
> •Ability to view tuples flowing through the topology
> •Ability to turn on/off debug events without having to stop/restart topology
> •Default debug events is off
> •User should be able to select a specific Spout or Bolt and see incoming 
> events and outgoing events
> •We could put a configurable numbers of events to view (e.g. last 100 events 
> or last 1 minute)
> •Tuple stream to have following info
> •Message id, batch/transaction id, name/value pair, timestamp, acked (boolean)
> •All the above to be available from Storm UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1056) allow supervisor log filename to be configurable via ENV variable

2016-03-23 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209115#comment-15209115
 ] 

Erik Weathers commented on STORM-1056:
--

[~kabhwan]: ahh, seems that [the release notes for storm 
0.10.0|https://storm.apache.org/2015/11/05/storm0100-released.html] were just 
missing STORM-1056, but it's actually present in v0.10.0:
* https://github.com/apache/storm/blob/v0.10.0/bin/storm.py#L80

And in the binary release tarball:
{code}
(/tmp) % wget 
http://www.carfab.com/apachesoftware/storm/apache-storm-0.10.0/apache-storm-0.10.0.tar.gz
...
(/tmp) % tar -xf apache-storm-0.10.0.tar.gz
(/tmp/apache-storm-0.10.0) % grep SUPERVI bin/storm.py 
STORM_SUPERVISOR_LOG_FILE = os.getenv('STORM_SUPERVISOR_LOG_FILE', 
"supervisor.log")
"-Dlogfile.name=" + STORM_SUPERVISOR_LOG_FILE,
{code}

> allow supervisor log filename to be configurable via ENV variable
> -
>
> Key: STORM-1056
> URL: https://issues.apache.org/jira/browse/STORM-1056
> Project: Apache Storm
>  Issue Type: Task
>  Components: storm-core
>Reporter: Erik Weathers
>Assignee: Erik Weathers
>Priority: Minor
> Fix For: 0.9.6
>
>
> *Requested feature:*  allow configuring the supervisor's log filename when 
> launching it via an ENV variable.
> *Motivation:* The storm-on-mesos project (https://github.com/mesos/storm) 
> relies on multiple Storm Supervisor processes per worker host, where each 
> Supervisor is dedicated to a particular topology.  This is part of the 
> framework's functionality of separating topologies from each other.  i.e., 
> storm-on-mesos is a multi-tenant system.  But before the change requested in 
> this issue, the logs from all supervisors on a worker host will be written 
> into a supervisor log with a single name of supervisor.log.  If all logs are 
> written to a common location on the mesos host, then all logs go to the same 
> log file.  Instead it would be desirable to separate the supervisor logs 
> per-topology, so that each tenant/topology-owner can peruse the logs that are 
> related to their own topology.  Thus this ticket is requesting the ability to 
> configure the supervisor log via an environment variable whilst invoking 
> bin/storm.py (or bin/storm in pre-0.10 storm releases).
> When this ticket is fixed, we will include the topology ID into the 
> supervisor log filename for storm-on-mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-1631) storm CGroup bugs 1) when launching workers as the user that submitted the topology 2) when initial cleanup of cgroup fails

2016-03-22 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-1631:
-
Summary: storm CGroup bugs 1) when launching workers as the user that 
submitted the topology 2) when initial cleanup of cgroup fails  (was: torm 
CGroup bugs 1) when launching workers as the user that submitted the topology 
2) when initial cleanup of cgroup fails)

> storm CGroup bugs 1) when launching workers as the user that submitted the 
> topology 2) when initial cleanup of cgroup fails
> ---
>
> Key: STORM-1631
> URL: https://issues.apache.org/jira/browse/STORM-1631
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Boyang Jerry Peng
>Assignee: Boyang Jerry Peng
>
> In secure multitenant storm, topology workers are launched with permission of 
> the user that submitted the topology. This causes a problem with cgroups 
> since workers are launched with permissions of the topology user which does 
> not have permissions to modify cgroups storm is using
> Also, the clean up code is not trying to clean up cgroups of killed workers 
> if the initial attempt failed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1342) support multiple logviewers per host for container-isolated worker logs

2016-01-26 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118650#comment-15118650
 ] 

Erik Weathers commented on STORM-1342:
--

STORM-1494 is adding support for the supervisor logs to be linked from the 
Nimbus UI.  So this will likely be another area to adjust when (if!?) this is 
fixed.

> support multiple logviewers per host for container-isolated worker logs
> ---
>
> Key: STORM-1342
> URL: https://issues.apache.org/jira/browse/STORM-1342
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Reporter: Erik Weathers
>Priority: Minor
>
> h3. Storm-on-Mesos Worker Logs are in varying directories
> When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each 
> topology's workers are isolated into separate containers.  By default the 
> worker logs will be saved into container-specific sandbox directories.  These 
> directories are also topology-specific by definition, because, as just 
> stated, the containers are specific to each topology.
> h3. Problem: Storm supports 1-and-only-1 Logviewer per Worker Host
> A challenge with this different way of running Storm is that the [Storm 
> logviewer|https://github.com/apache/storm/blob/768a85926373355c15cc139fd86268916abc6850/docs/_posts/2013-12-08-storm090-released.md#log-viewer-ui]
>  runs as a single instance on each worker host.   This doesn't play well with 
> having the topology worker logs in separate per-topology containers.  The one 
> logviewer doesn't know about the various sandbox directories that the Storm 
> Workers are writing to.  And if we just spawned new logviewers for each 
> container, the problem is that the Storm UI only knows about 1 global port 
> the logviewer, so you cannot just direct.
> These problems are documented (or linked to) from [Issue #6 in the 
> storm-on-mesos project|https://github.com/mesos/storm/issues/6]
> h3. Possible Solutions I can envision
> # configure the Storm workers to write to log directories that exist on the 
> raw host outside of the container sandbox, and run a single logviewer on a 
> host, which serves up the contents of that directory.
> #* violates one of the basic reasons for using containers: isolation.
> #* also prevents allow a standard use case for Mesos: running more than 1 
> instance of a Mesos Framework (e.g., "Storm Cluster") at once on same Mesos 
> Cluster. e.g., for Blue-Green deployments.
> #* a variation on this proposal is to somehow expose the sandbox dirs of all 
> storm containers to this singleton logviewer process (still has above 
> problems)
> # launch a separate logviewers in each container, and somehow register those 
> logviewers with Storm such that Storm knows for a given host which logviewer 
> port is assigned to a given topology.
> #* this is the proposed solution
> h3. Storm Changes for the Proposed Solution
> Nimbus or ZooKeeper could serve as a registrar, recording the association 
> between a slot (host + worker port) and the logviewer port that is serving 
> the workers logs. And the Storm-on-Mesos framework could update this registry 
> when launching a new worker.  (This proposal definitely calls for thorough 
> vetting and thinking.)
> h3. Storm-on-Mesos Framework Changes for the Proposed Solution
> Along with the interaction with the "registrar" proposed above, the 
> storm-on-mesos framework can be enhanced to launch multiple logviewers on a 
> given worker host, where each logviewer is dedicated to serving the worker 
> logs from a specific topology's container/sandbox directory.  This would be 
> done by launching a logviewer process within the topology's container, and 
> assigning it an arbitrary listening port that has been determined dynamically 
> through mesos (which treats ports as one of the schedulable resource 
> primitives of a worker host).  [Code implementing this 
> logviewer-port-allocation logic already 
> exists|https://github.com/mesos/storm/commit/af8c49beac04b530c33c1401c829caaa8e368a35],
>  but [that specific portion of the code was 
> reverted|https://github.com/mesos/storm/commit/dc3eee0f0e9c06f6da7b2fe697a8e4fc05b5227e]
>  because of the issues that inspired this ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1141) Maven Central does not have 0.10.0 libraries

2016-01-21 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111524#comment-15111524
 ] 

Erik Weathers commented on STORM-1141:
--

[~cburch]:  0.10.x and 0.10.0 don't have {{ClusterSummary.nimbuses}}:
* 
https://github.com/apache/storm/blob/v0.10.0/storm-core/src/jvm/backtype/storm/generated/ClusterSummary.java#L68-L70
* 
https://github.com/apache/storm/blob/0.10.x-branch/storm-core/src/jvm/backtype/storm/generated/ClusterSummary.java#L68-L70

That field [landed into 
master|https://github.com/apache/storm/commit/4502bffbe3f9b4cd3674a56afbda1bb115cec239]
 and wasn't put into 0.10.0.  I believe it's part of the HA Nimbus support that 
is in 0.11.x.

> Maven Central does not have 0.10.0 libraries
> 
>
> Key: STORM-1141
> URL: https://issues.apache.org/jira/browse/STORM-1141
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: caleb burch
>Assignee: P. Taylor Goetz
>Priority: Blocker
> Fix For: 0.10.0
>
>
> HDP has moved to 2.3 that features Storm 0.10.0.  The current storm-core jars 
> on maven central are back at 0.9.5 and the beta 0.10.0 drivers aren't up to 
> date.  (They lack the list of nimbus nodes so fail with a 
> "nimbus.uptime.secs" not set error when attempting to get ClusterInfo via the 
> java client).
> Any chance the latest 0.10.x build can be pushed to maven, or a timeframe of 
> when you expect to do it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-822) As a storm developer I’d like to use the new kafka consumer API (0.8.3) to reduce dependencies and use long term supported kafka apis

2016-01-19 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107726#comment-15107726
 ] 

Erik Weathers commented on STORM-822:
-

[~DeepNekro]:  can you please comment on whether your work directly overlaps 
with STORM-1015?

> As a storm developer I’d like to use the new kafka consumer API (0.8.3) to 
> reduce dependencies and use long term supported kafka apis 
> --
>
> Key: STORM-822
> URL: https://issues.apache.org/jira/browse/STORM-822
> Project: Apache Storm
>  Issue Type: Story
>  Components: storm-kafka
>Reporter: Thomas Becker
>Assignee: Hugo Louro
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-1342) support multiple logviewers per host for container-isolated worker logs

2015-11-23 Thread Erik Weathers (JIRA)

[
https://issues.apache.org/jira/browse/STORM-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Erik Weathers updated STORM-1342:
-
Description:
h3. Storm-on-Mesos Worker Logs are in varying directories
When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each
topology's workers are isolated into separate containers. By default the
worker logs will be saved into container-specific sandbox directories. These
directories are also topology-specific by definition, because, as just stated,
the containers are specific to each topology.

h3. Problem: Storm supports 1-and-only-1 Logviewer per Worker Host
A challenge with this different way of running Storm is that the [Storm
logviewer|https://github.com/apache/storm/blob/768a85926373355c15cc139fd86268916abc6850/docs/_posts/2013-12-08-storm090-released.md#log-viewer-ui]
runs as a single instance on each worker host. This doesn't play well with
having the topology worker logs in separate per-topology containers. The one
logviewer doesn't know about the various sandbox directories that the Storm
Workers are writing to. And if we just spawned new logviewers for each
container, the problem is that the Storm UI only knows about 1 global port the
logviewer, so you cannot just direct.

These problems are documented (or linked to) from [Issue #6 in the
storm-on-mesos project|https://github.com/mesos/storm/issues/6]

h3. Possible Solutions I can envision
# configure the Storm workers to write to log directories that exist on the raw
host outside of the container sandbox, and run a single logviewer on a host,
which serves up the contents of that directory.
#* violates one of the basic reasons for using containers: isolation.
#* also prevents allow a standard use case for Mesos: running more than 1
instance of a Mesos Framework (e.g., "Storm Cluster") at once on same Mesos
Cluster. e.g., for Blue-Green deployments.
#* a variation on this proposal is to somehow expose the sandbox dirs of all
storm containers to this singleton logviewer process (still has above problems)
# launch a separate logviewers in each container, and somehow register those
logviewers with Storm such that Storm knows for a given host which logviewer
port is assigned to a given topology.
#* this is the proposed solution

h3. Storm Changes for the Proposed Solution

Nimbus or ZooKeeper could serve as a registrar, recording the association
between a slot (host + worker port) and the logviewer port that is serving the
workers logs. And the Storm-on-Mesos framework could update this registry when
launching a new worker. (This proposal definitely calls for thorough vetting
and thinking.)

h3. Storm-on-Mesos Framework Changes for the Proposed Solution

Along with the interaction with the "registrar" proposed above, the
storm-on-mesos framework can be enhanced to launch multiple logviewers on a
given worker host, where each logviewer is dedicated to serving the worker logs
from a specific topology's container/sandbox directory. This would be done by
launching a logviewer process within the topology's container, and assigning it
an arbitrary listening port that has been determined dynamically through mesos
(which treats ports as one of the schedulable resource primitives of a worker
host). [Code implementing this logviewer-port-allocation logic already
exists|https://github.com/mesos/storm/commit/af8c49beac04b530c33c1401c829caaa8e368a35],
but [that specific portion of the code was
reverted|https://github.com/mesos/storm/commit/dc3eee0f0e9c06f6da7b2fe697a8e4fc05b5227e]
because of the issues that inspired this ticket.

was:
h3. Storm-on-Mesos Worker Logs are in varying directories
When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each
topology's workers are isolated into separate containers. By default the
worker logs will be saved into container-specific sandbox directories. These
directories are also topology-specific by definition, because, as just stated,
the containers are specific to each topology.

h3. Possible Solutions I can envision
# configure the Storm workers to write to log directories that exist on the raw
host outside of the container

[jira] [Created] (STORM-1342) support multiple logviewers per host for container-isolated worker logs

2015-11-23 Thread Erik Weathers (JIRA)

Erik Weathers created STORM-1342:

Summary: support multiple logviewers per host for
container-isolated worker logs
Key: STORM-1342
URL: https://issues.apache.org/jira/browse/STORM-1342
Project: Apache Storm
Issue Type: Improvement
Components: storm-core
Reporter: Erik Weathers
Priority: Minor

h3. Storm-on-Mesos Worker Logs are in varying directories
When using [storm-on-mesos|https://github.com/mesos/storm] with cgroups, each
topology's workers are isolated into separate containers. By default the
worker logs will be saved into container-specific sandbox directories. These
directories are also topology-specific by definition, because, as just stated,
the containers are specific to each topology.

h3. Storm Changes for the Proposed Solution

h3. Storm-on-Mesos Framework Changes for the Proposed Solution

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-1342) support multiple logviewers per host for container-isolated worker logs

2015-11-23 Thread Erik Weathers (JIRA)

[
https://issues.apache.org/jira/browse/STORM-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

h3. Storm Changes for the Proposed Solution

h3. Storm-on-Mesos Framework Changes for the Proposed Solution

[jira] [Created] (STORM-1216) button to kill all topologies in Storm UI

2015-11-17 Thread Erik Weathers (JIRA)

Erik Weathers created STORM-1216:


 Summary: button to kill all topologies in Storm UI
 Key: STORM-1216
 URL: https://issues.apache.org/jira/browse/STORM-1216
 Project: Apache Storm
  Issue Type: Wish
  Components: storm-core
Affects Versions: 0.11.0
Reporter: Erik Weathers
Priority: Minor


In the Storm-on-Mesos project we had a [request to have an ability to "shut 
down the storm cluster" via a UI 
button|https://github.com/mesos/storm/issues/46].   That could be accomplished 
via a button in the Storm UI to kill all of the topologies.

I understand if this is viewed as an undesirable feature, but I just wanted to 
document the request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-1027) Topology may hang because metric-tick function is a blocking call from spout

2015-11-09 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-1027:
-
Affects Version/s: 0.9.5
Fix Version/s: 0.9.6

> Topology may hang because metric-tick function is a blocking call from spout
> 
>
> Key: STORM-1027
> URL: https://issues.apache.org/jira/browse/STORM-1027
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 0.10.0, 0.9.5
>Reporter: Abhishek Agarwal
>Assignee: Abhishek Agarwal
>Priority: Critical
> Fix For: 0.10.0, 0.9.6
>
>
> Nathan had fixed the dining philosopher problem by putting a overflow buffer 
> in the spout so that spout is not blocking. However, overflow buffer is not 
> used when emitting the metric, and that could result in the deadlock. I 
> modified the executor to use overflow buffer for emitting metrics and 
> afterwards topology didn't hang. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1056) allow supervisor log filename to be configurable via ENV variable

2015-11-06 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994381#comment-14994381
 ] 

Erik Weathers commented on STORM-1056:
--

[~kabhwan]: seems this didn't get put into the 0.10.0 release as I had 
expected. :-(

Can you please ensure it's in the train for 0.10.1?

> allow supervisor log filename to be configurable via ENV variable
> -
>
> Key: STORM-1056
> URL: https://issues.apache.org/jira/browse/STORM-1056
> Project: Apache Storm
>  Issue Type: Task
>  Components: storm-core
>Reporter: Erik Weathers
>Assignee: Erik Weathers
>Priority: Minor
> Fix For: 0.9.6
>
>
> *Requested feature:*  allow configuring the supervisor's log filename when 
> launching it via an ENV variable.
> *Motivation:* The storm-on-mesos project (https://github.com/mesos/storm) 
> relies on multiple Storm Supervisor processes per worker host, where each 
> Supervisor is dedicated to a particular topology.  This is part of the 
> framework's functionality of separating topologies from each other.  i.e., 
> storm-on-mesos is a multi-tenant system.  But before the change requested in 
> this issue, the logs from all supervisors on a worker host will be written 
> into a supervisor log with a single name of supervisor.log.  If all logs are 
> written to a common location on the mesos host, then all logs go to the same 
> log file.  Instead it would be desirable to separate the supervisor logs 
> per-topology, so that each tenant/topology-owner can peruse the logs that are 
> related to their own topology.  Thus this ticket is requesting the ability to 
> configure the supervisor log via an environment variable whilst invoking 
> bin/storm.py (or bin/storm in pre-0.10 storm releases).
> When this ticket is fixed, we will include the topology ID into the 
> supervisor log filename for storm-on-mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-763) nimbus reassigned worker A to another machine, but other worker's netty client can't connect to the new worker A

2015-09-22 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-763:

Description: 
Debian 3.16.3-2~bpo70+1 (2014-09-21) x86_64 GNU/Linux
java version "1.7.0_03"
storm 0.9.4
cluster 50+ machines

my topology have 50+ worker, it can't emit  5 thousand tuples in ten 
minutes.
sometimes one worker is reassigned to another machine by nimbus because of task 
heartbeat timeout:
{code}
2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor 
my_topology-22-1428243953:[440 440] not alive
2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor 
my_topology-22-1428243953:[90 90] not alive
2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor 
my_topology-22-1428243953:[510 510] not alive
2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor 
my_topology-22-1428243953:[160 160] not alive
{code}

i can see the reassigned worker is already started in storm UI,  but  other 
worker write error log all the time:
{code}
2015-04-08T16:56:43.091+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:45.660+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:45.660+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:45.715+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:45.716+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:46.277+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:46.278+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:46.306+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:46.306+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:46.586+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:46.586+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:46.835+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
{code}

The worker of destined host is already started, and i can telnet 192.168.163.19 
5700.
however, why the netty client can't connect to the ip:port?

  was:
Debian 3.16.3-2~bpo70+1 (2014-09-21) x86_64 GNU/Linux
java version "1.7.0_03"
storm 0.9.4
cluster 50+ machines

my topology have 50+ worker, it can't emit  5 thousand tuples in ten 
minutes.
sometimes one worker is reassigned to another machine by nimbus because of task 
heartbeat timeout:
2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor 
my_topology-22-1428243953:[440 440] not alive
2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor 
my_topology-22-1428243953:[90 90] not alive
2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor 
my_topology-22-1428243953:[510 510] not alive
2015-04-08T16:51:23.026+0800 b.s.d.nimbus [INFO] Executor 
my_topology-22-1428243953:[160 160] not alive

i can see the reassigned worker is already started in storm UI,  but  other 
worker write error log all the time:
2015-04-08T16:56:43.091+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:45.660+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:45.660+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:45.715+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:45.716+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:46.277+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:46.278+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:46.306+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:46.306+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:46.586+0800 b.s.m.n.Client [ERROR] connection to 
Netty-Client-host_19/192.168.163.19:5700 is unavailable
2015-04-08T16:56:46.586+0800 b.s.m.n.Client [ERROR] dropping 1 message(s) 
destined for Netty-Client-host_19/192.168.163.19:5700
2015-04-08T16:56:46.835+0800 b.s.m.n.Client

[jira] [Updated] (STORM-107) Add better ways to construct topologies

2015-09-21 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-107:

Description: 
https://github.com/nathanmarz/storm/issues/649

AFAIK the only way to construct a topology is to manually wire them together, 
e.g.

{code}
  (topology
   {"firehose" (spout-spec firehose-spout)}
   {"our-bolt-1" (bolt-spec {"firehose" :shuffle}
some-bolt
:p 5)
"our-bolt-2" (bolt-spec {"our-bolt-1" ["word"]}
 some-other-bolt
 :p 6)})
{code}

This sort of manual specification of edges seems a bit too 1990's for me. I 
would like a modular way to express topologies, so that you can compose 
sub-topologies together. Another benefit of an alternative to this graph setup 
is that ensuring that the topology is correct does not mean tracing every edge 
in the graph to make sure the graph is right.

I am thinking maybe some sort of LINQ-style query that simply desugars to the 
arguments we pass into topology.

For example, the following could desugar into the two map arguments we're 
passing to topology:

{code}
(def firehose (mk-spout "firehose" firehose-spout))
(def bolt1 (mk-bolt "our-bolt-1" some-bolt :p 5))
(def bolt2 (mk-bolt "our-bolt-1" some-other-bolt :p 6))

(from-in thing (compose firehose
bolt1
bolt2)
  (select thing))
{code}

Here from-in is pulling thing out of the result of compose'ing the firehose and 
the bolts, forming the topology we saw before. mk-spout would register a named 
spout spec, and the from macro would return the two dictionaries passed into 
topology.

The specification needs a lot of work, but I'm willing to write the patch 
myself once it's nailed down. The question is, do you want me to write it and 
send it off to you, or am I going to have to build a storm-tools repot to 
distribute it?


--
mrflip:We have an internal tool for describing topologies at a high level, and 
though it hasn't reached production we have found:
1. it definitely makes sense to have one set of objects that describe 
topologies, and a different set of objects that express them. 
2. it probably makes sense to have those classes generate a static manifest: a 
lifeless JSON representation of a topology.

To the first point, initially we did it like storm: the FooEacher class would 
know how to wire itself into a topology(), and also know how to Foo each record 
that it received. We later refactored to separate topology construction from 
data handling: there is an EacherStage that represents anything that obeys the 
Each contract, so you'd say flow do source(:kafka_trident_spout) > 
eacher(:foo_eacher) > so_on() > and_so_forth(). The code became simpler and 
more powerful.
() Actually in storm stages are wired into the topology, but the issue is that 
they're around at run-time in both cases, requiring serialization and so forth.

More importantly, it's worth considering a static manifest.

The virtue of a manifest is that it is universal and static. If it's a JSON 
file, anything can generate it and anything can consume it; that would meet the 
needs of external programs which want to orchestrate Storm/Trident, as well as 
the repeated requests to visualize a topology in the UI. Also since it's 
static, the worker logic can simplify as it will know the whole graph in 
advance. From my experience, apart from the transactional code, the topology 
instantiation logic is the most complicated in the joint. That feels 
justifiable for the transaction logic but not for the topology instantiation.

The danger of a manifest is also that it is static -- you could find yourself 
on the primrose path to maven-style XML hell, where you wake up one day and 
find you've attached layers of ponderous machinery to make a static config file 
Turing-complete. I think the problem comes when you try to make the file 
human-editable. The manifest should expressly be the porcelain result of a DSL, 
with all decisions baked in -- it must not be a DSL.

In general, we find that absolute separation of orchestration (what things 
should be wired together) and action (actually doing things) seems painful at 
design time but ends up making things simpler and more powerful.

  was:
https://github.com/nathanmarz/storm/issues/649

AFAIK the only way to construct a topology is to manually wire them together, 
e.g.

  (topology
   {"firehose" (spout-spec firehose-spout)}
   {"our-bolt-1" (bolt-spec {"firehose" :shuffle}
some-bolt
:p 5)
"our-bolt-2" (bolt-spec {"our-bolt-1" ["word"]}
 some-other-bolt
 :p 6)})

This sort of manual specification of edges seems a bit too 1990's for me. I 
would like a modular way to express topologies, so that you can

[jira] [Created] (STORM-1056) allow supervisor log filename to be configurable via ENV variable

2015-09-20 Thread Erik Weathers (JIRA)

Erik Weathers created STORM-1056:


 Summary: allow supervisor log filename to be configurable via ENV 
variable
 Key: STORM-1056
 URL: https://issues.apache.org/jira/browse/STORM-1056
 Project: Apache Storm
  Issue Type: Task
Reporter: Erik Weathers
Priority: Minor
 Fix For: 0.10.0, 0.11.0, 0.9.6


Requested feature:  allow configuring the supervisor's log filename when 
launching it via an ENV variable.

Motivation: The storm-on-mesos project (https://github.com/mesos/storm) relies 
on multiple Storm Supervisor processes per worker host, where each Supervisor 
is dedicated to a particular topology.  This is part of the framework's 
functionality of separating topologies from each other.  i.e., storm-on-mesos 
is a multi-tenant system.  But before the change requested in this issue, the 
logs from all supervisors on a worker host will be written into a supervisor 
log with a single name of supervisor.log.  If all logs are written to a common 
location on the mesos host, then all logs go to the same log file.  Instead it 
would be desirable to separate the supervisor logs per-topology, so that each 
tenant/topology-owner can peruse the logs that are related to their own 
topology.  Thus this ticket is requesting the ability to configure the 
supervisor log via an environment variable whilst invoking bin/storm.py (or 
bin/storm in pre-0.10 storm releases).

When this ticket is fixed, we will include the topology ID into the supervisor 
log filename for storm-on-mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-1056) allow supervisor log filename to be configurable via ENV variable

2015-09-20 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-1056:
-
Description: 
*Requested feature:*  allow configuring the supervisor's log filename when 
launching it via an ENV variable.

*Motivation:* The storm-on-mesos project (https://github.com/mesos/storm) 
relies on multiple Storm Supervisor processes per worker host, where each 
Supervisor is dedicated to a particular topology.  This is part of the 
framework's functionality of separating topologies from each other.  i.e., 
storm-on-mesos is a multi-tenant system.  But before the change requested in 
this issue, the logs from all supervisors on a worker host will be written into 
a supervisor log with a single name of supervisor.log.  If all logs are written 
to a common location on the mesos host, then all logs go to the same log file.  
Instead it would be desirable to separate the supervisor logs per-topology, so 
that each tenant/topology-owner can peruse the logs that are related to their 
own topology.  Thus this ticket is requesting the ability to configure the 
supervisor log via an environment variable whilst invoking bin/storm.py (or 
bin/storm in pre-0.10 storm releases).

When this ticket is fixed, we will include the topology ID into the supervisor 
log filename for storm-on-mesos.

  was:
Requested feature:  allow configuring the supervisor's log filename when 
launching it via an ENV variable.

Motivation: The storm-on-mesos project (https://github.com/mesos/storm) relies 
on multiple Storm Supervisor processes per worker host, where each Supervisor 
is dedicated to a particular topology.  This is part of the framework's 
functionality of separating topologies from each other.  i.e., storm-on-mesos 
is a multi-tenant system.  But before the change requested in this issue, the 
logs from all supervisors on a worker host will be written into a supervisor 
log with a single name of supervisor.log.  If all logs are written to a common 
location on the mesos host, then all logs go to the same log file.  Instead it 
would be desirable to separate the supervisor logs per-topology, so that each 
tenant/topology-owner can peruse the logs that are related to their own 
topology.  Thus this ticket is requesting the ability to configure the 
supervisor log via an environment variable whilst invoking bin/storm.py (or 
bin/storm in pre-0.10 storm releases).

When this ticket is fixed, we will include the topology ID into the supervisor 
log filename for storm-on-mesos.


> allow supervisor log filename to be configurable via ENV variable
> -
>
> Key: STORM-1056
> URL: https://issues.apache.org/jira/browse/STORM-1056
> Project: Apache Storm
>  Issue Type: Task
>Reporter: Erik Weathers
>Priority: Minor
> Fix For: 0.10.0, 0.11.0, 0.9.6
>
>
> *Requested feature:*  allow configuring the supervisor's log filename when 
> launching it via an ENV variable.
> *Motivation:* The storm-on-mesos project (https://github.com/mesos/storm) 
> relies on multiple Storm Supervisor processes per worker host, where each 
> Supervisor is dedicated to a particular topology.  This is part of the 
> framework's functionality of separating topologies from each other.  i.e., 
> storm-on-mesos is a multi-tenant system.  But before the change requested in 
> this issue, the logs from all supervisors on a worker host will be written 
> into a supervisor log with a single name of supervisor.log.  If all logs are 
> written to a common location on the mesos host, then all logs go to the same 
> log file.  Instead it would be desirable to separate the supervisor logs 
> per-topology, so that each tenant/topology-owner can peruse the logs that are 
> related to their own topology.  Thus this ticket is requesting the ability to 
> configure the supervisor log via an environment variable whilst invoking 
> bin/storm.py (or bin/storm in pre-0.10 storm releases).
> When this ticket is fixed, we will include the topology ID into the 
> supervisor log filename for storm-on-mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-1047) document internals of bin/storm.py

2015-09-15 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-1047:
-
Description: 
The `python` script `bin/storm.py` is completely undocumented regarding its 
internals. Function comments only include a command line interface often 
omitting an explanation of arguments and their default values (e.g. it should 
be clear why the default value of `klass` of `nimbus` is 
`"backtype.storm.daemon.nimbus"` because that doesn't make sense to someone 
unfamiliar with the storm-core implementation).

Also explanations like "Launches the nimbus daemon. [...]" (again `nimbus` 
function) is good for a command line API doc, but insufficient for a function 
documentation (should mention that it starts a `java` process and passes 
`klass` as class name to it).

How does the script use `lib/`, `extlib/` and `extlib-daemon`? It's too complex 
to squeeze this info out of the source code.

  was:
The `python` script `bin/storm.py` is completely undocumented regarding it's 
internals. Function comments only include a command line interface often 
omitting an explanation of arguments and their default values (e.g. it should 
be clear why the default value of `klass` of `nimbus` is 
`"backtype.storm.daemon.nimbus"` because that doesn't make sense).

Also explanations like "Launches the nimbus daemon. [...]" (again `nimbus` 
function) is good for a command line API doc, but insufficient for a function 
documentation (should mention that it starts a `java` process and passes 
`klass` as class name to it).

How does the script use `lib/`, `extlib/` and `extlib-daemon`? It's too complex 
to squeeze this info out of the source code.


> document internals of bin/storm.py
> --
>
> Key: STORM-1047
> URL: https://issues.apache.org/jira/browse/STORM-1047
> Project: Apache Storm
>  Issue Type: Documentation
>Affects Versions: 0.10.0
>Reporter: Karl Richter
>  Labels: documentation
>
> The `python` script `bin/storm.py` is completely undocumented regarding its 
> internals. Function comments only include a command line interface often 
> omitting an explanation of arguments and their default values (e.g. it should 
> be clear why the default value of `klass` of `nimbus` is 
> `"backtype.storm.daemon.nimbus"` because that doesn't make sense to someone 
> unfamiliar with the storm-core implementation).
> Also explanations like "Launches the nimbus daemon. [...]" (again `nimbus` 
> function) is good for a command line API doc, but insufficient for a function 
> documentation (should mention that it starts a `java` process and passes 
> `klass` as class name to it).
> How does the script use `lib/`, `extlib/` and `extlib-daemon`? It's too 
> complex to squeeze this info out of the source code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

2015-09-14 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers reopened STORM-1043:
--
  Assignee: Erik Weathers

> Concurrent access to state on local FS by multiple supervisors
> --
>
> Key: STORM-1043
> URL: https://issues.apache.org/jira/browse/STORM-1043
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.9.5
>Reporter: Ernestas Vaiciukevičius
>Assignee: Erik Weathers
>  Labels: mesosphere
>
> Hi,
> we are running storm-mesos cluster and occassionaly workers die or are "lost" 
> in mesos. When this happens it often coincides with errors in logs related to 
> supervisors local state.
> By looking at the storm code it seems this might be caused by the way how 
> multiple supervisor processes access the local state in the same directory 
> via VersionedStore.
> For example: 
> https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
> Here every supervisor does this concurrently:
> 1. reads latest state from FS
> 2. possibly updates the state
> 3. writes the new version of the state
> Some updates could be lost if there are 2+ supervisors and they execute above 
> steps concurrently - then only the updates from last supervisor would remain 
> on the last state version on the disk.
> We observed local state changes quite often (seconds), so the likelihood of 
> this concurrency issue occurring is high.
> Some examples of exeptions:
> --
> java.lang.RuntimeException: Version already exists or data already exists
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.persist(LocalState.java:101) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:82) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:76) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at 
> backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382)
>  ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> ---
> java.io.FileNotFoundException: File 
> '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) 
> ~[commons-io-2.4.jar:2.4]
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) 
> ~[commons-io-2.4.jar:2.4]
> at 
> backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.get(LocalState.java:72) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
> at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
> at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
> at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
> at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> -



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

2015-09-14 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743763#comment-14743763
 ] 

Erik Weathers edited comment on STORM-1043 at 9/14/15 4:18 PM:
---

As I replied in the issue that [~ernisv] raised 
(https://github.com/mesos/storm/issues/60), the solution is to leverage mesos's 
ability to put framework's data into separate sandboxes.  Just *don't* set 
{{storm.local.dir}} and the cwd of the Mesos Executor will be used for the 
Supervisor, which will be in the supervisor-specific sandbox.

FYI [~revans2], the ports are taken care of automatically by Mesos's 
scheduler/offer system, as they are considered part of the resources that each 
topology is claiming on the Mesos worker nodes ("mesos-slave" has now been 
renamed as "mesos-agent").


was (Author: erikdw):
As I replied in the issue that [~ernisv] raised (), the solution is to leverage 
mesos's ability to put framework's data into separate sandboxes.  Just *don't* 
set {storm.local.dir} and the cwd of the Mesos Executor will be used for the 
Supervisor, which will be in the supervisor-specific sandbox.

FYI [~revans2], the ports are taken care of automatically by Mesos's 
scheduler/offer system, as they are considered part of the resources that each 
topology is claiming on the Mesos worker nodes ("mesos-slave" has now been 
renamed as "mesos-agent").

> Concurrent access to state on local FS by multiple supervisors
> --
>
> Key: STORM-1043
> URL: https://issues.apache.org/jira/browse/STORM-1043
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.9.5
>Reporter: Ernestas Vaiciukevičius
>Assignee: Erik Weathers
>  Labels: mesosphere
>
> Hi,
> we are running storm-mesos cluster and occassionaly workers die or are "lost" 
> in mesos. When this happens it often coincides with errors in logs related to 
> supervisors local state.
> By looking at the storm code it seems this might be caused by the way how 
> multiple supervisor processes access the local state in the same directory 
> via VersionedStore.
> For example: 
> https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
> Here every supervisor does this concurrently:
> 1. reads latest state from FS
> 2. possibly updates the state
> 3. writes the new version of the state
> Some updates could be lost if there are 2+ supervisors and they execute above 
> steps concurrently - then only the updates from last supervisor would remain 
> on the last state version on the disk.
> We observed local state changes quite often (seconds), so the likelihood of 
> this concurrency issue occurring is high.
> Some examples of exeptions:
> --
> java.lang.RuntimeException: Version already exists or data already exists
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.persist(LocalState.java:101) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:82) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:76) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at 
> backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382)
>  ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> ---
> java.io.FileNotFoundException: File 
> '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) 
> ~[commons-io-2.4.jar:2.4]
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) 
> ~[commons-io-2.4.jar:2.4]
> at 
> backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.get(LocalState.java:72) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
> at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
> at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
> at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
> at

[jira] [Closed] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

2015-09-14 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers closed STORM-1043.

Resolution: Invalid

> Concurrent access to state on local FS by multiple supervisors
> --
>
> Key: STORM-1043
> URL: https://issues.apache.org/jira/browse/STORM-1043
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.9.5
>Reporter: Ernestas Vaiciukevičius
>Assignee: Erik Weathers
>  Labels: mesosphere
>
> Hi,
> we are running storm-mesos cluster and occassionaly workers die or are "lost" 
> in mesos. When this happens it often coincides with errors in logs related to 
> supervisors local state.
> By looking at the storm code it seems this might be caused by the way how 
> multiple supervisor processes access the local state in the same directory 
> via VersionedStore.
> For example: 
> https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
> Here every supervisor does this concurrently:
> 1. reads latest state from FS
> 2. possibly updates the state
> 3. writes the new version of the state
> Some updates could be lost if there are 2+ supervisors and they execute above 
> steps concurrently - then only the updates from last supervisor would remain 
> on the last state version on the disk.
> We observed local state changes quite often (seconds), so the likelihood of 
> this concurrency issue occurring is high.
> Some examples of exeptions:
> --
> java.lang.RuntimeException: Version already exists or data already exists
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.persist(LocalState.java:101) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:82) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:76) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at 
> backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382)
>  ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> ---
> java.io.FileNotFoundException: File 
> '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) 
> ~[commons-io-2.4.jar:2.4]
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) 
> ~[commons-io-2.4.jar:2.4]
> at 
> backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.get(LocalState.java:72) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
> at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
> at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
> at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
> at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> -



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

2015-09-14 Thread Erik Weathers (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743763#comment-14743763
 ] 

Erik Weathers commented on STORM-1043:
--

As I replied in the issue that [~ernisv] raised (), the solution is to leverage 
mesos's ability to put framework's data into separate sandboxes.  Just *don't* 
set {storm.local.dir} and the cwd of the Mesos Executor will be used for the 
Supervisor, which will be in the supervisor-specific sandbox.

FYI [~revans2], the ports are taken care of automatically by Mesos's 
scheduler/offer system, as they are considered part of the resources that each 
topology is claiming on the Mesos worker nodes ("mesos-slave" has now been 
renamed as "mesos-agent").

> Concurrent access to state on local FS by multiple supervisors
> --
>
> Key: STORM-1043
> URL: https://issues.apache.org/jira/browse/STORM-1043
> Project: Apache Storm
>  Issue Type: Bug
>Affects Versions: 0.9.5
>Reporter: Ernestas Vaiciukevičius
>  Labels: mesosphere
>
> Hi,
> we are running storm-mesos cluster and occassionaly workers die or are "lost" 
> in mesos. When this happens it often coincides with errors in logs related to 
> supervisors local state.
> By looking at the storm code it seems this might be caused by the way how 
> multiple supervisor processes access the local state in the same directory 
> via VersionedStore.
> For example: 
> https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
> Here every supervisor does this concurrently:
> 1. reads latest state from FS
> 2. possibly updates the state
> 3. writes the new version of the state
> Some updates could be lost if there are 2+ supervisors and they execute above 
> steps concurrently - then only the updates from last supervisor would remain 
> on the last state version on the disk.
> We observed local state changes quite often (seconds), so the likelihood of 
> this concurrency issue occurring is high.
> Some examples of exeptions:
> --
> java.lang.RuntimeException: Version already exists or data already exists
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.persist(LocalState.java:101) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:82) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:76) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at 
> backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382)
>  ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> ---
> java.io.FileNotFoundException: File 
> '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) 
> ~[commons-io-2.4.jar:2.4]
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) 
> ~[commons-io-2.4.jar:2.4]
> at 
> backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.get(LocalState.java:72) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
> at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
> at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
> at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
> at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) 
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> -



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-188) Allow user to specify full configuration path when running storm command

2015-03-31 Thread Erik Weathers (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-188:

Summary: Allow user to specify full configuration path when running storm 
command  (was: Allow user to specifiy full configuration path when running 
storm command)

 Allow user to specify full configuration path when running storm command
 

 Key: STORM-188
 URL: https://issues.apache.org/jira/browse/STORM-188
 Project: Apache Storm
  Issue Type: Bug
Reporter: Sean Zhong
Priority: Minor
 Attachments: search_local_path_for_config.patch, storm-188.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 Currently, storm will only look up configuration path in java classpath. We 
 should also allow user to specify full configuration path. This is very 
 important for a shared cluster environment, like YARN. Multiple storm cluster 
 may runs with different configuration, but share same binary folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1839) Kinesis Spout

[jira] [Updated] (STORM-1766) A better algorithm server rack selection for RAS

[jira] [Updated] (STORM-1766) A better algorithm server rack selection for RAS

[jira] [Resolved] (STORM-143) Launching a process throws away standard out; can hang

[jira] [Commented] (STORM-143) Launching a process throws away standard out; can hang

[jira] [Commented] (STORM-1733) Logs from bin/storm are lost because stdout and stderr are not flushed

[jira] [Commented] (STORM-143) Launching a process throws away standard out; can hang

[jira] [Updated] (STORM-954) Topology Event Inspector

[jira] [Commented] (STORM-1056) allow supervisor log filename to be configurable via ENV variable

[jira] [Updated] (STORM-1631) storm CGroup bugs 1) when launching workers as the user that submitted the topology 2) when initial cleanup of cgroup fails

[jira] [Commented] (STORM-1342) support multiple logviewers per host for container-isolated worker logs

[jira] [Commented] (STORM-1141) Maven Central does not have 0.10.0 libraries

[jira] [Commented] (STORM-822) As a storm developer I’d like to use the new kafka consumer API (0.8.3) to reduce dependencies and use long term supported kafka apis

[jira] [Updated] (STORM-1342) support multiple logviewers per host for container-isolated worker logs

[jira] [Created] (STORM-1342) support multiple logviewers per host for container-isolated worker logs

[jira] [Updated] (STORM-1342) support multiple logviewers per host for container-isolated worker logs

[jira] [Created] (STORM-1216) button to kill all topologies in Storm UI

[jira] [Updated] (STORM-1027) Topology may hang because metric-tick function is a blocking call from spout

[jira] [Commented] (STORM-1056) allow supervisor log filename to be configurable via ENV variable

[jira] [Updated] (STORM-763) nimbus reassigned worker A to another machine, but other worker's netty client can't connect to the new worker A

[jira] [Updated] (STORM-107) Add better ways to construct topologies

[jira] [Created] (STORM-1056) allow supervisor log filename to be configurable via ENV variable

[jira] [Updated] (STORM-1056) allow supervisor log filename to be configurable via ENV variable

[jira] [Updated] (STORM-1047) document internals of bin/storm.py

[jira] [Reopened] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

[jira] [Comment Edited] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

[jira] [Closed] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

[jira] [Commented] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

[jira] [Updated] (STORM-188) Allow user to specify full configuration path when running storm command

29 matches

Site Navigation

Mail list logo

Footer information