[jira] [Commented] (STORM-2629) Can't build site on Windows due to Nokogiri failing to install

2017-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165544#comment-16165544
 ] 

ASF GitHub Bot commented on STORM-2629:
---

Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm-site/pull/1
  
Which pair(s) of OS and Ruby version are you trying out?
I'm trying out the change but I'm experiencing crash on redcarpet. 
(redcarpet.rb has `require 'redcarpet.so'` but there's no file) macOS Sierra 
(10.12.6) and Ruby 2.4.1. Odd thing is that standalone command 'redcarpet' 
works.


> Can't build site on Windows due to Nokogiri failing to install
> --
>
> Key: STORM-2629
> URL: https://issues.apache.org/jira/browse/STORM-2629
> Project: Apache Storm
>  Issue Type: Bug
>  Components: asf-site
>Reporter: Stig Rohde Døssing
>Assignee: Stig Rohde Døssing
>Priority: Minor
>  Labels: pull-request-available
> Attachments: STORM-2629.patch
>
>
> I'm using Windows 10's bash support, and I'm having some trouble building the 
> site since Nokogiri won't install. 
> {code}
> Running 'configure' for libxml2 2.9.2... ERROR, review
> '/tmp/bundler20170714-31-159r6j1nokogiri-1.6.7.2/gems/nokogiri-1.6.7.2/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.2/configure.log'
> to see what happened. Last lines are:
> 
> checking build system type... ./config.guess: line 4: $'\r': command not found
> ./config.guess: line 6: $'\r': command not found
> ./config.guess: line 33: $'\r': command not found
> {code}
> Upgrading Nokogiri fixes this issue, so I'd like to upgrade the gemfile to 
> the latest version of github-pages, i.e. run "bundler update". As far as I 
> can tell, we only need to make a small number of changes to get it working.
> * It seems like the meaning of the {{page}} variable in a layout has changed 
> in Jekyll. _layouts/about.html should use {{layout}} to refer to it's own 
> variables instead of {{page}} (which would belong to the concrete page being 
> rendered). The other layouts don't refer to their own front matter, so there 
> shouldn't be any issue there
> * Jekyll has made redcarpet an optional dependency, so the gemfile should 
> list that dependency explicitly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (STORM-2629) Can't build site on Windows due to Nokogiri failing to install

2017-09-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated STORM-2629:
--
Labels: pull-request-available  (was: )

> Can't build site on Windows due to Nokogiri failing to install
> --
>
> Key: STORM-2629
> URL: https://issues.apache.org/jira/browse/STORM-2629
> Project: Apache Storm
>  Issue Type: Bug
>  Components: asf-site
>Reporter: Stig Rohde Døssing
>Assignee: Stig Rohde Døssing
>Priority: Minor
>  Labels: pull-request-available
> Attachments: STORM-2629.patch
>
>
> I'm using Windows 10's bash support, and I'm having some trouble building the 
> site since Nokogiri won't install. 
> {code}
> Running 'configure' for libxml2 2.9.2... ERROR, review
> '/tmp/bundler20170714-31-159r6j1nokogiri-1.6.7.2/gems/nokogiri-1.6.7.2/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.2/configure.log'
> to see what happened. Last lines are:
> 
> checking build system type... ./config.guess: line 4: $'\r': command not found
> ./config.guess: line 6: $'\r': command not found
> ./config.guess: line 33: $'\r': command not found
> {code}
> Upgrading Nokogiri fixes this issue, so I'd like to upgrade the gemfile to 
> the latest version of github-pages, i.e. run "bundler update". As far as I 
> can tell, we only need to make a small number of changes to get it working.
> * It seems like the meaning of the {{page}} variable in a layout has changed 
> in Jekyll. _layouts/about.html should use {{layout}} to refer to it's own 
> variables instead of {{page}} (which would belong to the concrete page being 
> rendered). The other layouts don't refer to their own front matter, so there 
> shouldn't be any issue there
> * Jekyll has made redcarpet an optional dependency, so the gemfile should 
> list that dependency explicitly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (STORM-2494) KafkaSpout does not handle CommitFailedException

2017-09-13 Thread Hugo Louro (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hugo Louro resolved STORM-2494.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

STORM-2640 also fixes STORM-2494

> KafkaSpout does not handle CommitFailedException
> 
>
> Key: STORM-2494
> URL: https://issues.apache.org/jira/browse/STORM-2494
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 1.1.0
>Reporter: Yuri Barseghyan
>Assignee: Hugo Louro
> Fix For: 1.2.0
>
>
> In situations when tuple processing takes longer than session timeout, we get 
> CommitFailedException and instead of recovering from it Storm worker dies.
> {code}
> 2017-04-26 11:07:04.902 o.a.s.util [ERROR] Async loop died!
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
> completed since the group has already rebalanced and assigned the partitions 
> to another member. This means that the time between subsequent calls to 
> poll() was longer than the configured session.timeout.ms, which typically 
> implies that the poll loop is spending too much time message processing. You 
> can address this either by increasing the session timeout or by reducing the 
> maximum size of batches returned in poll() with max.poll.records.
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:578)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:519)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:679)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:658)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:426)
>  ~[stormjar.jar:3.0.2]
> \tat org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:278) 
> ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:404)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1058)
>  ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.storm.kafka.spout.KafkaSpout.commitOffsetsForAckedTuples(KafkaSpout.java:384)
>  ~[stormjar.jar:3.0.2]
> \tat org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:219) 
> ~[stormjar.jar:3.0.2]
> \tat 
> org.apache.storm.daemon.executor$fn__4976$fn__4991$fn__5022.invoke(executor.clj:644)
>  ~[storm-core-1.1.0.jar:1.1.0]
> \tat org.apache.storm.util$async_loop$fn__557.invoke(util.clj:484) 
> [storm-core-1.1.0.jar:1.1.0]
> \tat clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
> \tat java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
> 2017-04-26 11:07:04.909 o.a.s.d.executor [ERROR] 
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
> completed since the group has already rebalanced and assigned the partitions 
> to another member. This means that the time between subsequent calls to 
> poll() was longer than the configured session.timeout.ms, which typically 
> implies that the poll loop is spending too much time message processing. You 
> can address this either by increasing the session timeout or by reducing the 
> maximum size of batches returned in poll() with max.poll.records.
> \tat 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:578)
>  ~[stormjar.jar:3.0.2]
> \tat 
> 

[jira] [Updated] (STORM-2549) The fix for STORM-2343 is incomplete, and the spout can still get stuck on failed tuples

2017-09-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated STORM-2549:
--
Labels: pull-request-available  (was: )

> The fix for STORM-2343 is incomplete, and the spout can still get stuck on 
> failed tuples
> 
>
> Key: STORM-2549
> URL: https://issues.apache.org/jira/browse/STORM-2549
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 2.0.0, 1.1.0
>Reporter: Stig Rohde Døssing
>Assignee: Stig Rohde Døssing
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Example:
> Say maxUncommittedOffsets is 10, maxPollRecords is 5, and the committedOffset 
> is 0.
> The spout will initially emit up to offset 10, because it is allowed to poll 
> until numNonRetriableTuples is >= maxUncommittedOffsets
> The spout will be allowed to emit another 5 tuples if offset 10 fails, so if 
> that happens, offsets 10-14 will get emitted. If offset 1 fails and 2-14 get 
> acked, the spout gets stuck because it will count the "extra tuples" 11-14 in 
> numNonRetriableTuples.
> An similar case is the one where maxPollRecords doesn't divide 
> maxUncommittedOffsets evenly. If it were 3 in the example above, the spout 
> might just immediately emit offsets 1-12. If 2-12 get acked, offset 1 cannot 
> be reemitted.
> The proposed solution is the following:
> * Enforce maxUncommittedOffsets on a per partition basis (i.e. actual limit 
> will be multiplied by the number of partitions) by always allowing poll for 
> retriable tuples that are within maxUncommittedOffsets tuples of the 
> committed offset. Pause any non-retriable partitions if the partition has 
> passed the maxUncommittedOffsets limit, and some other partition is polling 
> for retries while also at the maxUncommittedOffsets limit. 
> Example of this functionality:
> MaxUncommittedOffsets is 100
> MaxPollRecords is 10
> Committed offset for partition 0 and 1 is 0.
> Partition 0 has emitted 0
> Partition 1 has emitted 0...95, 97, 99, 101, 103 (some offsets compacted away)
> Partition 1, message 99 is retriable
> We check that message 99 is within 100 emitted tuples of offset 0 (it is the 
> 97th tuple after offset 0, so it is)
> We do not pause partition 0 because that partition isn't at the 
> maxUncommittedOffsets limit.
> Seek to offset 99 on partition 1 and poll
> We get back offset 99, 101, 103 and potentially 7 new tuples. Say the lowest 
> of these is at offset 104.
> The spout emits offset 99, filters out 101 and 103 because they were already 
> emitted, and emits the 7 new tuples.
> If offset 104 (or later) become retriable, they are not retried until the 
> committed offset moves. This is because offset 104 is the 101st tuple emitted 
> after offset 0, so it isn't allowed to retry until the committed offset moves.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (STORM-2710) Release Apache Storm 1.2.0

2017-09-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/STORM-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stig Rohde Døssing updated STORM-2710:
--
Description: This is to track remaining issues on releasing Storm 1.2.0. 
These are in addition to the issues listed in 
https://issues.apache.org/jira/browse/STORM-2709.  (was: This is to track 
remaining issues on releasing Storm 1.2.0.)

> Release Apache Storm 1.2.0
> --
>
> Key: STORM-2710
> URL: https://issues.apache.org/jira/browse/STORM-2710
> Project: Apache Storm
>  Issue Type: Epic
>Reporter: Jungtaek Lim
> Fix For: 1.2.0
>
>
> This is to track remaining issues on releasing Storm 1.2.0. These are in 
> addition to the issues listed in 
> https://issues.apache.org/jira/browse/STORM-2709.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (STORM-2710) Release Apache Storm 1.2.0

2017-09-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/STORM-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stig Rohde Døssing updated STORM-2710:
--
Fix Version/s: 1.2.0

> Release Apache Storm 1.2.0
> --
>
> Key: STORM-2710
> URL: https://issues.apache.org/jira/browse/STORM-2710
> Project: Apache Storm
>  Issue Type: Epic
>Reporter: Jungtaek Lim
> Fix For: 1.2.0
>
>
> This is to track remaining issues on releasing Storm 1.2.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (STORM-2709) Release Apache Storm 1.1.2

2017-09-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/STORM-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stig Rohde Døssing updated STORM-2709:
--
Fix Version/s: 1.1.2

> Release Apache Storm 1.1.2
> --
>
> Key: STORM-2709
> URL: https://issues.apache.org/jira/browse/STORM-2709
> Project: Apache Storm
>  Issue Type: Epic
>Reporter: Jungtaek Lim
> Fix For: 1.1.2
>
>
> This is to track remaining issues on releasing Storm 1.1.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (STORM-2730) Add in config options for acker cpu and memory

2017-09-13 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans resolved STORM-2730.

   Resolution: Fixed
Fix Version/s: 2.0.0

Thanks [~ethanli],

I merged this into master

>  Add in config options for acker cpu and memory
> ---
>
> Key: STORM-2730
> URL: https://issues.apache.org/jira/browse/STORM-2730
> Project: Apache Storm
>  Issue Type: Improvement
>Reporter: Ethan Li
>Assignee: Ethan Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We want to add in configurations for acker cpu and memory requirements 
> instead of just using topology.component.resources.onheap.memory.mb etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (STORM-2738) The number of ackers should default to the number of actual running workers on RAS cluster

2017-09-13 Thread Ethan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Li updated STORM-2738:

Summary: The number of ackers should default to the number of actual 
running workers on RAS cluster  (was: The number of ackers doesn't default to 
the number of actual running workers on RAS cluster)

> The number of ackers should default to the number of actual running workers 
> on RAS cluster
> --
>
> Key: STORM-2738
> URL: https://issues.apache.org/jira/browse/STORM-2738
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Ethan Li
>Assignee: Ethan Li
>Priority: Minor
> Attachments: Screen Shot 2017-09-13 at 11.13.41 AM.png
>
>
> I am pushing back our internal code change.
> *Problem*:
> If topology.acker.executors is not set,  the number of ackers will be equal 
> to topology.workers. But on RAS cluster, we don't set topology.workers 
> because the number of workers will be determined by the scheduler. So in this 
> case, the number of ackers will always be 1 (see attached screenshot)
> *Analysis*:
> The number of ackers has to be computed before scheduling happens, so it 
> knows how to schedule the topology. The number of workers is not set until 
> the topology is scheduled, so it is a bit of a chicken and egg problem.
> *Solution*:
> We could probably use the total amount of requested memory when the topology 
> is submitted divided by the memory per worker to get an estimate that is 
> better than 1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (STORM-2738) The number of ackers doesn't default to the number of actual running workers on RAS cluster

2017-09-13 Thread Ethan Li (JIRA)
Ethan Li created STORM-2738:
---

 Summary: The number of ackers doesn't default to the number of 
actual running workers on RAS cluster
 Key: STORM-2738
 URL: https://issues.apache.org/jira/browse/STORM-2738
 Project: Apache Storm
  Issue Type: Bug
Reporter: Ethan Li
Assignee: Ethan Li
Priority: Minor
 Attachments: Screen Shot 2017-09-13 at 11.13.41 AM.png

I am pushing back our internal code change.
*Problem*:
If topology.acker.executors is not set,  the number of ackers will be equal to 
topology.workers. But on RAS cluster, we don't set topology.workers because the 
number of workers will be determined by the scheduler. So in this case, the 
number of ackers will always be 1 (see attached screenshot)

*Analysis*:
The number of ackers has to be computed before scheduling happens, so it knows 
how to schedule the topology. The number of workers is not set until the 
topology is scheduled, so it is a bit of a chicken and egg problem.

*Solution*:
We could probably use the total amount of requested memory when the topology is 
submitted divided by the memory per worker to get an estimate that is better 
than 1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (STORM-2736) o.a.s.b.BlobStoreUtils [ERROR] Could not update the blob with key

2017-09-13 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans resolved STORM-2736.

   Resolution: Fixed
Fix Version/s: 1.1.2
   1.2.0
   2.0.0

Thanks [~hmcc],

I merged this into master 1.x-branch and 1.1.x-branch.  I could not pull it 
into 1.0.x-branch as there was a merge conflict and a missing dependency.  If 
you really want it there then feel free to reopen this and put up another pull 
request. Keep up the good work.

> o.a.s.b.BlobStoreUtils [ERROR] Could not update the blob with key
> -
>
> Key: STORM-2736
> URL: https://issues.apache.org/jira/browse/STORM-2736
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 1.1.1
>Reporter: Heather McCartney
>Assignee: Heather McCartney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.2.0, 1.1.2
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Sometimes, after our topologies have been running for a while, Zookeeper does 
> not respond within an appropriate time and we see
> {code}
> 2017-08-16 10:18:38.859 o.a.s.zookeeper [INFO] ip-10-181-20-70.ec2.internal 
> lost leadership.
> 2017-08-16 10:21:31.144 o.a.s.zookeeper [INFO] ip-10-181-20-70.ec2.internal 
> gained leadership, checking if it has all the topology code locally.
> 2017-08-16 10:21:46.201 o.a.s.zookeeper [INFO] Accepting leadership, all 
> active topology found localy.
> {code}
> That's fine, and we probably need to allocate more resources. But after a new 
> leader is chosen, we then see:
> {code}
> o.a.s.b.BlobStoreUtils [ERROR] Could not update the blob with key
> {code}
> over and over.
> I can't figure out yet how to cause the conditions that lead to Zookeeper 
> becoming unresponsive, but it is possible to reproduce the {{BlobStoreUtils}} 
> error by restarting Zookeeper.
> The problem, I think, is that the loop 
> [here|https://github.com/apache/storm/blob/v1.1.1/storm-core/src/jvm/org/apache/storm/blobstore/BlobStoreUtils.java#L175]
>  never executes because the {{nimbusInfos}} list is empty. If I add a check 
> similar to 
> [this|https://github.com/apache/storm/blob/v1.1.1/storm-core/src/jvm/org/apache/storm/blobstore/BlobStoreUtils.java#L244]
>  for a node which exists but has no children, the error goes away.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (STORM-2171) blob recovery on a single host results in deadlock

2017-09-13 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164627#comment-16164627
 ] 

Robert Joseph Evans commented on STORM-2171:


We actually saw this during some testing, but it has been a while, and it was 
only the once, so I don't have the stack trace right now.

> blob recovery on a single host results in deadlock
> --
>
> Key: STORM-2171
> URL: https://issues.apache.org/jira/browse/STORM-2171
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 2.0.0
>Reporter: Robert Joseph Evans
>
> It might be more versions but I have only tested this on 2.x.
> Essentially when trying to find replicas to copy blobs from LocalFSBlobStore 
> does not exclude itself.  This results in a deadlock where it is holding a 
> lock trying to download the blob, and at the same time has done a request 
> back to itself trying to download the blob, but it will never finish because 
> it is blocked on the same lock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (STORM-2171) blob recovery on a single host results in deadlock

2017-09-13 Thread Jungtaek Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164339#comment-16164339
 ] 

Jungtaek Lim commented on STORM-2171:
-

Just curious: Is this based on theory or are you experienced deadlock? If you 
experienced the thing and have stack dump it would be better to attach it. And 
are you planning to fix it for yourself?

> blob recovery on a single host results in deadlock
> --
>
> Key: STORM-2171
> URL: https://issues.apache.org/jira/browse/STORM-2171
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 2.0.0
>Reporter: Robert Joseph Evans
>
> It might be more versions but I have only tested this on 2.x.
> Essentially when trying to find replicas to copy blobs from LocalFSBlobStore 
> does not exclude itself.  This results in a deadlock where it is holding a 
> lock trying to download the blob, and at the same time has done a request 
> back to itself trying to download the blob, but it will never finish because 
> it is blocked on the same lock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (STORM-2675) KafkaTridentSpoutOpaque not committing offsets to Kafka

2017-09-13 Thread Jungtaek Lim (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved STORM-2675.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Also merged into 1.x branch.

> KafkaTridentSpoutOpaque not committing offsets to Kafka
> ---
>
> Key: STORM-2675
> URL: https://issues.apache.org/jira/browse/STORM-2675
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 1.1.0
>Reporter: Preet Puri
>Assignee: Stig Rohde Døssing
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.2.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Every time I restart the topology the spout was picking the earliest message 
> even though poll strategy is set UNCOMMITTED_EARLIEST.  I looked at Kafka's  
> __consumer_offsets topic to see if spout (consumer) is committing the offsets 
> but did not find any commits. I am not even able to locate the code in the 
> KafkaTridentSpoutEmitter class where we are updating the commits?
> conf.put(Config.TOPOLOGY_DEBUG, true);
> conf.put(Config.TOPOLOGY_WORKERS, 1);
> conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 4); //tried with1 as well
> conf.put(Config.TRANSACTIONAL_ZOOKEEPER_ROOT, "/aggregate");
> conf.put(Config.TRANSACTIONAL_ZOOKEEPER_SERVERS, Arrays.asList(new 
> String[]{"localhost"}));
> conf.put(Config.TRANSACTIONAL_ZOOKEEPER_PORT, 2181);
>  protected static KafkaSpoutConfig 
> getPMStatKafkaSpoutConfig() {
> ByTopicRecordTranslator byTopic =
> new ByTopicRecordTranslator<>((r) -> new Values(r.topic(), r.key(), 
> r.value()),
> new Fields(TOPIC, PARTITION_KEY, PAYLOAD), SENSOR_STREAM);
> return new KafkaSpoutConfig.Builder String>(Utils.getBrokerHosts(),
> StringDeserializer.class, null, Utils.getKafkaEnrichedPMSTopicName())
> .setMaxPartitionFectchBytes(10 * 1024) // 10 KB
> .setRetry(getRetryService())
> .setOffsetCommitPeriodMs(10_000)
> 
> .setFirstPollOffsetStrategy(FirstPollOffsetStrategy.UNCOMMITTED_EARLIEST)
> .setMaxUncommittedOffsets(250)
> .setProp("value.deserializer", 
> "io.confluent.kafka.serializers.KafkaAvroDeserializer")
> .setProp("schema.registry.url","http://localhost:8081;)
> .setProp("specific.avro.reader",true)
> .setGroupId(AGGREGATION_CONSUMER_GROUP)
> .setRecordTranslator(byTopic).build();
>   }
> Stream pmStatStream =
> topology.newStream("statStream", new 
> KafkaTridentSpoutOpaque<>(getPMStatKafkaSpoutConfig())).parallelismHint(1)
> storm-version - 1.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (STORM-2693) Topology submission or kill takes too much time when topologies grow to a few hundred

2017-09-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated STORM-2693:
--
Labels: pull-request-available  (was: )

> Topology submission or kill takes too much time when topologies grow to a few 
> hundred
> -
>
> Key: STORM-2693
> URL: https://issues.apache.org/jira/browse/STORM-2693
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 0.9.6, 1.0.2, 1.1.0, 1.0.3
>Reporter: Yuzhao Chen
>  Labels: pull-request-available
> Attachments: 2FA30CD8-AF15-4352-992D-A67BD724E7FB.png, 
> D4A30D40-25D5-4ACF-9A96-252EBA9E6EF6.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Now for a storm cluster with 40 hosts [with 32 cores/128G memory] and 
> hundreds of topologies, nimbus submission and killing will take about minutes 
> to finish. For example, for a cluster with 300 hundred of topologies,it will 
> take about 8 minutes to submit a topology, this affect our efficiency 
> seriously.
> So, i check out the nimbus code and find two factor that will effect nimbus 
> submission/killing time for a scheduling round:
> * read existing-assignments from zookeeper for every topology [will take 
> about 4 seconds for a 300 topologies cluster]
> * read all the workers heartbeats and update the state to nimbus cache [will 
> take about 30 seconds for a 300 topologies cluster]
> the key here is that Storm now use zookeeper to collect heartbeats [not RPC], 
> and also keep physical plan [assignments] using zookeeper which can be 
> totally local in nimbus.
> So, i think we should make some changes to storm's heartbeats and assignments 
> management.
> For assignment promotion:
> 1. nimbus will put the assignments in local disk
> 2. when restart or HA leader trigger nimbus will recover assignments from zk 
> to local disk
> 3. nimbus will tell supervisor its assignment every time through RPC every 
> scheduling round
> 4. supervisor will sync assignments at fixed time
> For heartbeats promotion:
> 1. workers will report executors ok or wrong to supervisor at fixed time
> 2. supervisor will report workers heartbeats to nimbus at fixed time
> 3. if supervisor die, it will tell nimbus through runtime hook
> or let nimbus find it through aware supervisor if is survive 
> 4. let supervisor decide if worker is running ok or invalid , supervisor will 
> tell nimbus which executors of every topology are ok



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (STORM-2693) Topology submission or kill takes too much time when topologies grow to a few hundred

2017-09-13 Thread Jungtaek Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164246#comment-16164246
 ] 

Jungtaek Lim commented on STORM-2693:
-

[~danny0405]
Yes I think it should work. Then I don't think we should store cache to disk, 
because ZK is still source of truth, and when Nimbus is restarting it must read 
ZK other than reading from disk. Caching to memory looks sufficient.

PR for Metrics V2 is available: https://github.com/apache/storm/pull/2203

> Topology submission or kill takes too much time when topologies grow to a few 
> hundred
> -
>
> Key: STORM-2693
> URL: https://issues.apache.org/jira/browse/STORM-2693
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 0.9.6, 1.0.2, 1.1.0, 1.0.3
>Reporter: Yuzhao Chen
> Attachments: 2FA30CD8-AF15-4352-992D-A67BD724E7FB.png, 
> D4A30D40-25D5-4ACF-9A96-252EBA9E6EF6.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Now for a storm cluster with 40 hosts [with 32 cores/128G memory] and 
> hundreds of topologies, nimbus submission and killing will take about minutes 
> to finish. For example, for a cluster with 300 hundred of topologies,it will 
> take about 8 minutes to submit a topology, this affect our efficiency 
> seriously.
> So, i check out the nimbus code and find two factor that will effect nimbus 
> submission/killing time for a scheduling round:
> * read existing-assignments from zookeeper for every topology [will take 
> about 4 seconds for a 300 topologies cluster]
> * read all the workers heartbeats and update the state to nimbus cache [will 
> take about 30 seconds for a 300 topologies cluster]
> the key here is that Storm now use zookeeper to collect heartbeats [not RPC], 
> and also keep physical plan [assignments] using zookeeper which can be 
> totally local in nimbus.
> So, i think we should make some changes to storm's heartbeats and assignments 
> management.
> For assignment promotion:
> 1. nimbus will put the assignments in local disk
> 2. when restart or HA leader trigger nimbus will recover assignments from zk 
> to local disk
> 3. nimbus will tell supervisor its assignment every time through RPC every 
> scheduling round
> 4. supervisor will sync assignments at fixed time
> For heartbeats promotion:
> 1. workers will report executors ok or wrong to supervisor at fixed time
> 2. supervisor will report workers heartbeats to nimbus at fixed time
> 3. if supervisor die, it will tell nimbus through runtime hook
> or let nimbus find it through aware supervisor if is survive 
> 4. let supervisor decide if worker is running ok or invalid , supervisor will 
> tell nimbus which executors of every topology are ok



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (STORM-2675) KafkaTridentSpoutOpaque not committing offsets to Kafka

2017-09-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated STORM-2675:
--
Labels: pull-request-available  (was: )

> KafkaTridentSpoutOpaque not committing offsets to Kafka
> ---
>
> Key: STORM-2675
> URL: https://issues.apache.org/jira/browse/STORM-2675
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 1.1.0
>Reporter: Preet Puri
>Assignee: Stig Rohde Døssing
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Every time I restart the topology the spout was picking the earliest message 
> even though poll strategy is set UNCOMMITTED_EARLIEST.  I looked at Kafka's  
> __consumer_offsets topic to see if spout (consumer) is committing the offsets 
> but did not find any commits. I am not even able to locate the code in the 
> KafkaTridentSpoutEmitter class where we are updating the commits?
> conf.put(Config.TOPOLOGY_DEBUG, true);
> conf.put(Config.TOPOLOGY_WORKERS, 1);
> conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 4); //tried with1 as well
> conf.put(Config.TRANSACTIONAL_ZOOKEEPER_ROOT, "/aggregate");
> conf.put(Config.TRANSACTIONAL_ZOOKEEPER_SERVERS, Arrays.asList(new 
> String[]{"localhost"}));
> conf.put(Config.TRANSACTIONAL_ZOOKEEPER_PORT, 2181);
>  protected static KafkaSpoutConfig 
> getPMStatKafkaSpoutConfig() {
> ByTopicRecordTranslator byTopic =
> new ByTopicRecordTranslator<>((r) -> new Values(r.topic(), r.key(), 
> r.value()),
> new Fields(TOPIC, PARTITION_KEY, PAYLOAD), SENSOR_STREAM);
> return new KafkaSpoutConfig.Builder String>(Utils.getBrokerHosts(),
> StringDeserializer.class, null, Utils.getKafkaEnrichedPMSTopicName())
> .setMaxPartitionFectchBytes(10 * 1024) // 10 KB
> .setRetry(getRetryService())
> .setOffsetCommitPeriodMs(10_000)
> 
> .setFirstPollOffsetStrategy(FirstPollOffsetStrategy.UNCOMMITTED_EARLIEST)
> .setMaxUncommittedOffsets(250)
> .setProp("value.deserializer", 
> "io.confluent.kafka.serializers.KafkaAvroDeserializer")
> .setProp("schema.registry.url","http://localhost:8081;)
> .setProp("specific.avro.reader",true)
> .setGroupId(AGGREGATION_CONSUMER_GROUP)
> .setRecordTranslator(byTopic).build();
>   }
> Stream pmStatStream =
> topology.newStream("statStream", new 
> KafkaTridentSpoutOpaque<>(getPMStatKafkaSpoutConfig())).parallelismHint(1)
> storm-version - 1.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)