Re: [E] Curator, Zookeeper and Guava woes (again)

2021-12-14 Thread Jordan Zimmerman
> It looks like someone attempted to shade Curator 2.13.0 and Curator 3.3.0, 
> but it was only done for curator-client. 

No - all of Curator is shaded except for 3 classes (that part was fixed in 
Curator 5). We have a Tech Note on this: 
https://cwiki.apache.org/confluence/display/CURATOR/TN13 
<https://cwiki.apache.org/confluence/display/CURATOR/TN13> - you can merely 
exclude Guava when adding a dependency to it (In Maven use the  tag). 
So, if you're able to upgrade to Curator 2.12.x you should have no problems.

-Jordan

> On Dec 14, 2021, at 7:56 PM, Phong X. Nguyen  wrote:
> 
> It looks like someone attempted to shade Curator 2.13.0 and Curator 3.3.0, 
> but it was only done for curator-client. 
> 
> I know they're very old releases, but is it possible that we could get a 
> Curator-2.13.1 and Curator-3.3.1 that also shades curator-framework, 
> curator-x-discovery, etc.? An official fork would be better than an internal 
> fork. 
> 
> On Tue, Dec 14, 2021 at 2:51 AM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
>> Is there a way out of this situation without deploying our own 
>> custom-patched Curator 2 or Curator 3 in which we remove Guava dependencies? 
>> Merely upgrading to Guava 17 isn't much of a win for us (and meanwhile our 
>> automated security scanners keep yelling at us for using outdated libraries) 
> 
> If you're unable to upgrade you could create a shaded version of Curator 
> maybe with the Maven Shade plugin. This would require an internal fork for 
> you though.
> 
> I'm not familiar with CURATOR-526 and ZOOKEEPER-3577 to comment on those. 
> 
> -JZ
> 
>> On Dec 14, 2021, at 12:48 AM, Phong X. Nguyen > <mailto:p.ngu...@yahooinc.com>> wrote:
>> 
>> I posted about this earlier this year but I'm not sure if this was fully 
>> resolved. 
>> 
>> We're currently on Guava 14, Zookeeper 3.5.7 and Curator 2.4.2.
>> 
>> We'd like to upgrade to a reasonably recent version of Guava. That requires 
>> an upgrade to Curator 4, but that brings in the problem reported in 
>> CURATOR-526 and ZOOKEEPER-3577. Our Zookeeper configuration uses the old 
>> securedClientPort setting, which is currently unavailable in the new Dynamic 
>> Reconfiguration file format and thus is rejected by Zookeeper's new 
>> EnsembleTracker. 
>> 
>> As far as I can tell, CURATOR-526 just silences the log message, but upon 
>> trying to read through the code, it doesn't look like the internal 
>> EnsembleTracker will be properly populated as it won't read the old 
>> securedClientPort configuration. If that can't be resolved, for newer 
>> versions of Zookeeper, is there an alternative to securedClientPort?
>> 
>> Is there a way out of this situation without deploying our own 
>> custom-patched Curator 2 or Curator 3 in which we remove Guava dependencies? 
>> Merely upgrading to Guava 17 isn't much of a win for us (and meanwhile our 
>> automated security scanners keep yelling at us for using outdated libraries) 
>> 
>> Sorry to bug everyone again, but the old version of Guava that we're stuck 
>> with is starting to cause serious problems as certain libraries are 
>> requiring newer versions of Guava, and we cannot upgrade (even to fix CVEs). 
>> 
>> Thanks,
>> - Phong X. Nguyen
>> 
> 



Re: Curator, Zookeeper and Guava woes (again)

2021-12-14 Thread Jordan Zimmerman
> Is there a way out of this situation without deploying our own custom-patched 
> Curator 2 or Curator 3 in which we remove Guava dependencies? Merely 
> upgrading to Guava 17 isn't much of a win for us (and meanwhile our automated 
> security scanners keep yelling at us for using outdated libraries) 

If you're unable to upgrade you could create a shaded version of Curator maybe 
with the Maven Shade plugin. This would require an internal fork for you though.

I'm not familiar with CURATOR-526 and ZOOKEEPER-3577 to comment on those. 

-JZ

> On Dec 14, 2021, at 12:48 AM, Phong X. Nguyen  wrote:
> 
> I posted about this earlier this year but I'm not sure if this was fully 
> resolved. 
> 
> We're currently on Guava 14, Zookeeper 3.5.7 and Curator 2.4.2.
> 
> We'd like to upgrade to a reasonably recent version of Guava. That requires 
> an upgrade to Curator 4, but that brings in the problem reported in 
> CURATOR-526 and ZOOKEEPER-3577. Our Zookeeper configuration uses the old 
> securedClientPort setting, which is currently unavailable in the new Dynamic 
> Reconfiguration file format and thus is rejected by Zookeeper's new 
> EnsembleTracker. 
> 
> As far as I can tell, CURATOR-526 just silences the log message, but upon 
> trying to read through the code, it doesn't look like the internal 
> EnsembleTracker will be properly populated as it won't read the old 
> securedClientPort configuration. If that can't be resolved, for newer 
> versions of Zookeeper, is there an alternative to securedClientPort?
> 
> Is there a way out of this situation without deploying our own custom-patched 
> Curator 2 or Curator 3 in which we remove Guava dependencies? Merely 
> upgrading to Guava 17 isn't much of a win for us (and meanwhile our automated 
> security scanners keep yelling at us for using outdated libraries) 
> 
> Sorry to bug everyone again, but the old version of Guava that we're stuck 
> with is starting to cause serious problems as certain libraries are requiring 
> newer versions of Guava, and we cannot upgrade (even to fix CVEs). 
> 
> Thanks,
> - Phong X. Nguyen
> 



Re: [External Sender] Double Leadership Issue

2021-11-08 Thread Jordan Zimmerman
It would be better if these test cases are turned into unit tests so we can 
have them in the code base to prevent future problems. Can you please turn them 
into Unit Tests?

-Jordan

> On Nov 6, 2021, at 8:52 PM, Viswanathan Rajagopal 
>  wrote:
> 
> Have linked the test cases in my original mail.
> Test cases below,
> https://github.com/ViswaNXplore/curator/commit/0949137f7323a1d5f34afc85a7042e8d9e85a8bc
>  
> 
> https://github.com/ViswaNXplore/curator/commit/1aadd4b5dbc8811a2e7a49b92f29170333e8ba4a
>  
> 
> 



Re: [External Sender] Double Leadership Issue

2021-11-05 Thread Jordan Zimmerman
I wonder if we should add a watcher to the ZNode created by the LeaderLatch. 
That way, we can handle the case where the ZNode disappears underneath the 
latch. Other than that, it would be very helpful to have a test case for this. 
Can you try to create one please? My involvement with the project is very 
limited these days but if there's a good test case I can find some time to make 
changes.

>> Possible Solution (where we would like to hear your thoughts/suggestions):
>> 
>> *   The current curator code during reset() does
>>*   setLeadership(false) first followed by
>>*   setNode(null) (i.e. deleting its latch node)
>> 
>> *   Swapping these two should resolve the issue, as we setting leadership to 
>> false once after its latch node gets deleted
>>*   setNode(null) (i.e. deleting its latch node) first followed by
>>*   setLeadership(false)

The problem here is that deletion isn't guaranteed here. It could happen at a 
later time when there are network partitions.

-Jordan

> On Nov 3, 2021, at 3:33 PM, Viswanathan Rajagopal 
>  wrote:
> 
> Hi Jordan,
> 
> The dual leadership continue indefinitely in my case
> 
> Many Thanks,
> Viswa
> 
> From: Jordan Zimmerman 
> Date: Wednesday, 3 November 2021 at 08:02
> To: d...@curator.apache.org 
> Cc: user@curator.apache.org 
> Subject: [External Sender] Re: Double Leadership Issue
> Do I understand this correctly that there are two leaders for a short period 
> of time - i.e. it corrects itself eventually? Or does the dual leadership 
> continue indefinitely?
> 
> -Jordan
> 
>> On Nov 2, 2021, at 11:48 AM, Viswanathan Rajagopal 
>>  wrote:
>> 
>> Hello Team,
>> 
>> Greetings!
>> Any update on the below mentioned observation?
>> 
>> Many Thanks,
>> Viswa
>> 
>> From: Viswanathan Rajagopal 
>> Date: Wednesday, 27 October 2021 at 16:15
>> To: d...@curator.apache.org , 
>> user@curator.apache.org 
>> Subject: [External Sender] Double Leadership Issue
>> Hello Team,
>> 
>> Greetings!
>> While using Curator Leader Latch Recipe in our application,  we observed a 
>> potential issue where two clients have become a leader. Raised a Jira on the 
>> same for your reference (Jira Link : 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CURATOR-2D620=DwIF-g=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8=3LDys_XJLYEnQ0_K3auTUo8DsOom0xZAMAC7ASgkt0A=
>>  )
>> Quick summary of below description
>> 
>> *   Our use case explained
>> *   Issue details
>> *   Timeline of events mentioned
>> *   Attached test code to reproduce the reported issue
>> *   Possible solution given, where we need your suggestions
>> Our use case:
>> 
>> *   Two clients trying to get the leadership using Curator Leader Latch 
>> Recipe. On LeaderLatchListener.isLeader() Client would become a leader and 
>> on LeaderLatchListener.notLeader() Client would lose its leadership
>> Issue details:
>> 
>> *   One of the clients on receiving two CuratorConnectionListener 
>> RECONNECTED events in quick succession, we observed that LeaderLatch 
>> EventThreads interleave with each other, resulting in "latch node deletion" 
>> happen after "client becoming a leader", thereby the client will still be a 
>> leader though its corresponding latch node has been deleted
>> *   And the other client who tried to get leadership creates its latch node 
>> and sees itself in first index and thus become a leader
>> *   So at this point, two clients have become a leader
>> 
>> Timeline of events:
>> 
>> *   Timeline events of Client A whose corresponding latch node is deleted 
>> but still be a leader
>>*   At t1, 1st RECONNECTED event fired
>>*   At t2, [ EventThread of 1st RECONNECTED event ] Resets leadership 
>> (true -> false)
>>*   At t3, [ EventThread of 1st RECONNECTED event ] Fire 
>> “listener.notLeader()”
>>*   At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch node
>>*   At t5, [ EventThread of 1st RECONNECTED event ] Creates new latch node
>>*   At t6, 2nd RECONNECTED event fired
>>*   At t7, [ EventThread of 2nd RECONNECTED event ] Resets leadership 
>> (false -> false), Basically NOP
>>*   At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. 
>> Basically NOP
>>*   At t9, [ EventThread of 1st RECONNECTED event ] Get children -> sort 
>> them -> check leadership -> Set leadership t

Re: Double Leadership Issue

2021-11-03 Thread Jordan Zimmerman
Do I understand this correctly that there are two leaders for a short period of 
time - i.e. it corrects itself eventually? Or does the dual leadership continue 
indefinitely?

-Jordan

> On Nov 2, 2021, at 11:48 AM, Viswanathan Rajagopal 
>  wrote:
> 
> Hello Team,
> 
> Greetings!
> Any update on the below mentioned observation?
> 
> Many Thanks,
> Viswa
> 
> From: Viswanathan Rajagopal 
> Date: Wednesday, 27 October 2021 at 16:15
> To: d...@curator.apache.org , 
> user@curator.apache.org 
> Subject: [External Sender] Double Leadership Issue
> Hello Team,
> 
> Greetings!
> While using Curator Leader Latch Recipe in our application,  we observed a 
> potential issue where two clients have become a leader. Raised a Jira on the 
> same for your reference (Jira Link : 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CURATOR-2D620=DwIF-g=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8=3LDys_XJLYEnQ0_K3auTUo8DsOom0xZAMAC7ASgkt0A=
>  )
> Quick summary of below description
> 
>  *   Our use case explained
>  *   Issue details
>  *   Timeline of events mentioned
>  *   Attached test code to reproduce the reported issue
>  *   Possible solution given, where we need your suggestions
> Our use case:
> 
>  *   Two clients trying to get the leadership using Curator Leader Latch 
> Recipe. On LeaderLatchListener.isLeader() Client would become a leader and on 
> LeaderLatchListener.notLeader() Client would lose its leadership
> Issue details:
> 
>  *   One of the clients on receiving two CuratorConnectionListener 
> RECONNECTED events in quick succession, we observed that LeaderLatch 
> EventThreads interleave with each other, resulting in "latch node deletion" 
> happen after "client becoming a leader", thereby the client will still be a 
> leader though its corresponding latch node has been deleted
>  *   And the other client who tried to get leadership creates its latch node 
> and sees itself in first index and thus become a leader
>  *   So at this point, two clients have become a leader
> 
> Timeline of events:
> 
>  *   Timeline events of Client A whose corresponding latch node is deleted 
> but still be a leader
> *   At t1, 1st RECONNECTED event fired
> *   At t2, [ EventThread of 1st RECONNECTED event ] Resets leadership 
> (true -> false)
> *   At t3, [ EventThread of 1st RECONNECTED event ] Fire 
> “listener.notLeader()”
> *   At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch node
> *   At t5, [ EventThread of 1st RECONNECTED event ] Creates new latch node
> *   At t6, 2nd RECONNECTED event fired
> *   At t7, [ EventThread of 2nd RECONNECTED event ] Resets leadership 
> (false -> false), Basically NOP
> *   At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. 
> Basically NOP
> *   At t9, [ EventThread of 1st RECONNECTED event ] Get children -> sort 
> them -> check leadership -> Set leadership to true -> Fire “Has become a 
> leader” leader listener event
> *   At t10, [ EventThread of 2nd RECONNECTED event ] Delete latch node 
> (which actually deletes the latch node with which the Client A has become a 
> leader through previous step)
> 
>  *   Timeline events of Client B who also become a leader
> *   At t11, Client B creates its latch node -> Get children -> sort them 
> -> check leadership -> Set leadership to true -> Fire “Has become a leader” 
> leader listener event
> 
> This ends up in a situation where both Client A and Client B have become a 
> leader
> 
> As we observe, over the period (t8 -> t10), Client A’s LeaderLatch 
> EventThreads interleave with each other causing leadership latch node deleted 
> but local state still assumes that it’s a leader
> 
> Reproducing the issue:
> 
>  *   Wrote a Junit test case firing an artificial curator connection 
> reconnected events and simulated LeaderLatch EventThreads interleave through 
> CountDownLatches
>  *   Test simulator for 2.5.0:
> *   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_6a78a3a0de032212175d80caa64f140c743219ae=DwIF-g=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8=tveG7d6kAd8SeywmuCN7zyd1ufTvARJdEEc0gxTs2rU=
> *   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ViswaNXplore_curator_commit_d2b1b33a6885c05619c058aa2bee63962fd6fa08=DwIF-g=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc=mwSLlPO0Vtstmu1dce0TMFqf5lUxD2SPdNdc1k4NXjU=nFB4puWyvVe8eRiZ3oi_C8Ao1WkqCb9wsonPrIl3LY8=jixCmfLZiaseXsSWihiUiYMw8cj5cDg1O6gLFJY3kKg=
>  *   Test Simulator for latest Curator version:
> *   
> 

Re: LeaderSelector Recipe Can Result In Orphaned Ephemeral Nodes And No Leader Election

2021-08-04 Thread Jordan Zimmerman
I'll look at this when I get a chance, however, there are a ton of issues 
related to ephemerals not getting deleted in the ZK Jira. I suspect this may be 
a ZK bug:

https://issues.apache.org/jira/browse/ZOOKEEPER-3018?jql=project%20%3D%20ZOOKEEPER%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20text%20~%20%22ephemeral%20not%20deleted%22

-Jordan

> On Aug 2, 2021, at 1:59 AM, H S  wrote:
> 
> I didn’t try adding the extra delete to see if it had an impact, but I 
> created some docker containers that can be used to replicate the issue if 
> that helps. Sometimes you have to run a fair number of restarts on the 
> zookeeper containers to trigger the issue, but I’ve always been able to 
> replicate it if I restart the zookeeper instances enough. You can find the 
> source for building/running the docker containers here: 
> https://github.com/hsbugreports/zookeeper_test 
> <https://github.com/hsbugreports/zookeeper_test>
> 
> Thanks.
> -hs
> 
>> On Jul 30, 2021, at 2:32 PM, Jordan Zimmerman > <mailto:jor...@jordanzimmerman.com>> wrote:
>> 
>> A quick thought... I wonder if we need to add a guaranteed delete right here 
>> for the node just in case: 
>> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L625
>>  
>> <https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L625>
>> 
>>> On Jul 30, 2021, at 2:29 PM, H S >> <mailto:hsbugrepo...@icloud.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> Thanks for the quick response. On the call to 
>>> CuratorFramworkFactory.newClient I am specifying RetryForever as the retry 
>>> policy if that is the retry policy you are referring to. I did notice that 
>>> I was also occasionally seeing that get interrupted (example stack trace 
>>> below) but that seemed like it was happening in a create call, not a 
>>> delete, but perhaps something similar is happening in the delete scenario 
>>> but never getting logged. I have also noticed that sometimes when it’s 
>>> happening I am not seeing that original debug stack trace, so I think there 
>>> might be a couple of possible races that are causing it to happen. Maybe 
>>> I’ll try to put together a docker compose image or something if I can that 
>>> simulates what I was doing to create the issue so that it can be 
>>> independently verified. 
>>> 
>>> Thanks.
>>> -hs
>>> 
>>> java.lang.InterruptedException: sleep interrupted
>>> at java.lang.Thread.sleep(Native Method) ~[?:?]
>>> at java.lang.Thread.sleep(Thread.java:340) ~[?:?]
>>> at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) ~[?:?]
>>> at 
>>> org.apache.curator.RetryLoopImpl.lambda$static$0(RetryLoopImpl.java:39) 
>>> ~[109:curator-client:5.2.0]
>>> at 
>>> org.apache.curator.retry.RetryForever.allowRetry(RetryForever.java:50) 
>>> [109:curator-client:5.2.0]
>>> at 
>>> org.apache.curator.RetryLoopImpl.takeException(RetryLoopImpl.java:76) 
>>> [109:curator-client:5.2.0]
>>> at 
>>> org.apache.curator.connection.ThreadLocalRetryLoop$WrappedRetryLoop.takeException(ThreadLocalRetryLoop.java:115)
>>>  [109:curator-client:5.2.0]
>>> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:99) 
>>> [109:curator-client:5.2.0]
>>> at 
>>> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
>>>  [110:curator-framework:5.2.0]
>>> at 
>>> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
>>>  [110:curator-framework:5.2.0]
>>> at 
>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
>>>  [110:curator-framework:5.2.0]
>>> at 
>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573)
>>>  [110:curator-framework:5.2.0]
>>> at 
>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
>>>  [110:curator-framework:5.2.0]
>>> at 
>>> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
>>>  [111:curator-recipes:5.2.0]
>>> at 
>>> org.apache.curator.framework.recipes.locks.Lock

Re: LeaderSelector Recipe Can Result In Orphaned Ephemeral Nodes And No Leader Election

2021-07-30 Thread Jordan Zimmerman
A quick thought... I wonder if we need to add a guaranteed delete right here 
for the node just in case: 
https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L625

> On Jul 30, 2021, at 2:29 PM, H S  wrote:
> 
> Hi,
> 
> Thanks for the quick response. On the call to 
> CuratorFramworkFactory.newClient I am specifying RetryForever as the retry 
> policy if that is the retry policy you are referring to. I did notice that I 
> was also occasionally seeing that get interrupted (example stack trace below) 
> but that seemed like it was happening in a create call, not a delete, but 
> perhaps something similar is happening in the delete scenario but never 
> getting logged. I have also noticed that sometimes when it’s happening I am 
> not seeing that original debug stack trace, so I think there might be a 
> couple of possible races that are causing it to happen. Maybe I’ll try to put 
> together a docker compose image or something if I can that simulates what I 
> was doing to create the issue so that it can be independently verified. 
> 
> Thanks.
> -hs
> 
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method) ~[?:?]
> at java.lang.Thread.sleep(Thread.java:340) ~[?:?]
> at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) ~[?:?]
> at 
> org.apache.curator.RetryLoopImpl.lambda$static$0(RetryLoopImpl.java:39) 
> ~[109:curator-client:5.2.0]
> at 
> org.apache.curator.retry.RetryForever.allowRetry(RetryForever.java:50) 
> [109:curator-client:5.2.0]
> at 
> org.apache.curator.RetryLoopImpl.takeException(RetryLoopImpl.java:76) 
> [109:curator-client:5.2.0]
> at 
> org.apache.curator.connection.ThreadLocalRetryLoop$WrappedRetryLoop.takeException(ThreadLocalRetryLoop.java:115)
>  [109:curator-client:5.2.0]
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:99) 
> [109:curator-client:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:231)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:242)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:93)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:412)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:483)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:66)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:247)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:241)
>  [111:curator-recipes:5.2.0]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:?]
> at java.lang.Thread.run(Thread.java:748) [?:?]
> 
>> On Jul 30, 2021, at 8:57 AM, Jordan Zimmerman > <mailto:jor...@jordanzimmerman.com>> wrote:
>> 
>> Curator has a feature called "protected 

Re: [External Sender] Double Locking Issue

2021-07-30 Thread Jordan Zimmerman
I'm now wondering if this is a ZooKeeper bug? In your original timeline you say 
that Client B sees its original ephemeral node that should have been deleted. 
ZooKeeper is taking time to delete that ZNode? A simple fix would be to try to 
write something to that ZNode. I would hope that ZooKeeper would reject the 
write on the now expired ZNode. So, the lock recipe would be enhanced so that 
it tries to do a setData on the node before it accepts that is has the lock. 
Just a thought.

> On Jul 30, 2021, at 9:54 AM, Jordan Zimmerman  
> wrote:
> 
> Even more... actually, this should already work right? It's been a long time 
> since I looked at this code so I wrote a quick test. wait() will already 
> exits on Connection loss. I apologize, but what I said below is not correct. 
> The LockInternals watcher gets called on connection loss and interrupts the 
> thread already via notifyAll(). So, again, sorry for the noise.
> 
> I'll go over the scenario again to see if I have more comments.
> 
> -Jordan
> 
>> On Jul 30, 2021, at 9:18 AM, Jordan Zimmerman > <mailto:jor...@jordanzimmerman.com>> wrote:
>> 
>> Actually... I'm looking at LockInternals and it already has a watcher set on 
>> the previous node before it calls wait(). That watcher will get called when 
>> there's a connection problem. It would be pretty easy to add something in 
>> that watcher to interrupt the waiting thread. This could be a pretty simple 
>> change actually.
>> 
>> Watcher set just before waiting: 
>> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L301
>>  
>> <https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L301>
>> The Watcher: 
>> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L64
>>  
>> <https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L64>
>> 
>> So, off the top of my head, an AtomicReference field could be added 
>> to LockInternals. It gets set to the current thread just before waiting. If 
>> the Watcher is called with a connection event and the AtomicReference isn't 
>> empty, interrupt the thread. This will cause wait() to throw 
>> InterruptedException and then the lock node will be deleted.
>> 
>> -Jordan
>> 
>>> On Jul 30, 2021, at 9:11 AM, Jordan Zimmerman >> <mailto:jor...@jordanzimmerman.com>> wrote:
>>> 
>>> Making lock code watch its lock node it has created
>>> 
>>> I don't think this is useful. Once the lock method returns the lock code 
>>> can't notify the caller/user that the lock node has disappeared. We'd need 
>>> some kind of additional class "lock watcher" or something that client code 
>>> would have to periodically call to ensure that it still has the lock. tbh - 
>>> this has been discussed a lot over the years on the ZooKeeper/Curator 
>>> channels. Just because you have a ZNode doesn't mean it's 100% valid. You 
>>> should still practice optimistic locks, etc. But, maybe we can reduce the 
>>> edge cases with this lock watcher or some other mechanism.
>>> 
>>> Session ID changes after wait exits it suggests the lock node should no 
>>> longer be trusted and it should be deleted and re-created
>>> 
>>> This is a simple change. But, I don't know if it would help much. It would 
>>> be better to interrupt waiting lockers when the connection is lost. If an 
>>> internal mechanism forced any locks blocked in wait() to throw an exception 
>>> then those lock threads will already delete their ZNodes and try again. 
>>> This would be the best solution maybe? So, before the lock code goes to 
>>> wait() it sets a ConnectionStateListener (or something similar) that will 
>>> interrupt the thread when there are connection problems.
>>> 
>>> Leader Latch code is well protected
>>> 
>>> Yes - the leader recipes are better at this. Maybe there's an opportunity 
>>> to merge/change code. We could consider deprecating the lock recipes in 
>>> favor of these?
>>> 
>>> -Jordan
>>> 
>>>> On Jul 29, 2021, at 2:35 AM, Viswanathan Rajagopal 
>>>> >>> <mailto:viswanathan.rajag...@workday.com>> wrote:
>>>> 
>>>> Hi Jordan,
>>>>  
>>>> Thanks for the suggesti

Re: [External Sender] Double Locking Issue

2021-07-30 Thread Jordan Zimmerman
Even more... actually, this should already work right? It's been a long time 
since I looked at this code so I wrote a quick test. wait() will already exits 
on Connection loss. I apologize, but what I said below is not correct. The 
LockInternals watcher gets called on connection loss and interrupts the thread 
already via notifyAll(). So, again, sorry for the noise.

I'll go over the scenario again to see if I have more comments.

-Jordan

> On Jul 30, 2021, at 9:18 AM, Jordan Zimmerman  
> wrote:
> 
> Actually... I'm looking at LockInternals and it already has a watcher set on 
> the previous node before it calls wait(). That watcher will get called when 
> there's a connection problem. It would be pretty easy to add something in 
> that watcher to interrupt the waiting thread. This could be a pretty simple 
> change actually.
> 
> Watcher set just before waiting: 
> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L301
>  
> <https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L301>
> The Watcher: 
> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L64
>  
> <https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L64>
> 
> So, off the top of my head, an AtomicReference field could be added 
> to LockInternals. It gets set to the current thread just before waiting. If 
> the Watcher is called with a connection event and the AtomicReference isn't 
> empty, interrupt the thread. This will cause wait() to throw 
> InterruptedException and then the lock node will be deleted.
> 
> -Jordan
> 
>> On Jul 30, 2021, at 9:11 AM, Jordan Zimmerman > <mailto:jor...@jordanzimmerman.com>> wrote:
>> 
>> Making lock code watch its lock node it has created
>> 
>> I don't think this is useful. Once the lock method returns the lock code 
>> can't notify the caller/user that the lock node has disappeared. We'd need 
>> some kind of additional class "lock watcher" or something that client code 
>> would have to periodically call to ensure that it still has the lock. tbh - 
>> this has been discussed a lot over the years on the ZooKeeper/Curator 
>> channels. Just because you have a ZNode doesn't mean it's 100% valid. You 
>> should still practice optimistic locks, etc. But, maybe we can reduce the 
>> edge cases with this lock watcher or some other mechanism.
>> 
>> Session ID changes after wait exits it suggests the lock node should no 
>> longer be trusted and it should be deleted and re-created
>> 
>> This is a simple change. But, I don't know if it would help much. It would 
>> be better to interrupt waiting lockers when the connection is lost. If an 
>> internal mechanism forced any locks blocked in wait() to throw an exception 
>> then those lock threads will already delete their ZNodes and try again. This 
>> would be the best solution maybe? So, before the lock code goes to wait() it 
>> sets a ConnectionStateListener (or something similar) that will interrupt 
>> the thread when there are connection problems.
>> 
>> Leader Latch code is well protected
>> 
>> Yes - the leader recipes are better at this. Maybe there's an opportunity to 
>> merge/change code. We could consider deprecating the lock recipes in favor 
>> of these?
>> 
>> -Jordan
>> 
>>> On Jul 29, 2021, at 2:35 AM, Viswanathan Rajagopal 
>>> >> <mailto:viswanathan.rajag...@workday.com>> wrote:
>>> 
>>> Hi Jordan,
>>>  
>>> Thanks for the suggestions. Much helpful.
>>>  
>>> Quick summary of my below comments,
>>> Added test case simulator to reproduce issue (though artificially)
>>> Added a proposed code just to get your review and suggestions to see 
>>> whether that would work
>>> Please find my detailed comments inline,
>>>  
>>> > I think there's generally a hole in Curator's lock recipe. The lock code 
>>> > does not watch the lock node it has created. So, another process or (as 
>>> > you found) a race with the server might cause the lock node to disappear 
>>> > underneath the lock instance after it thinks it has the lock. One thing 
>>> > we can do is to check the session ID before waiting in 
>>> > LockInternals.internalLockLoop(). If the session ID changes after wait 
>>> > exits it sug

Re: [External Sender] Double Locking Issue

2021-07-30 Thread Jordan Zimmerman
Actually... I'm looking at LockInternals and it already has a watcher set on 
the previous node before it calls wait(). That watcher will get called when 
there's a connection problem. It would be pretty easy to add something in that 
watcher to interrupt the waiting thread. This could be a pretty simple change 
actually.

Watcher set just before waiting: 
https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L301
 
<https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L301>
The Watcher: 
https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L64
 
<https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/LockInternals.java#L64>

So, off the top of my head, an AtomicReference field could be added to 
LockInternals. It gets set to the current thread just before waiting. If the 
Watcher is called with a connection event and the AtomicReference isn't empty, 
interrupt the thread. This will cause wait() to throw InterruptedException and 
then the lock node will be deleted.

-Jordan

> On Jul 30, 2021, at 9:11 AM, Jordan Zimmerman  
> wrote:
> 
> Making lock code watch its lock node it has created
> 
> I don't think this is useful. Once the lock method returns the lock code 
> can't notify the caller/user that the lock node has disappeared. We'd need 
> some kind of additional class "lock watcher" or something that client code 
> would have to periodically call to ensure that it still has the lock. tbh - 
> this has been discussed a lot over the years on the ZooKeeper/Curator 
> channels. Just because you have a ZNode doesn't mean it's 100% valid. You 
> should still practice optimistic locks, etc. But, maybe we can reduce the 
> edge cases with this lock watcher or some other mechanism.
> 
> Session ID changes after wait exits it suggests the lock node should no 
> longer be trusted and it should be deleted and re-created
> 
> This is a simple change. But, I don't know if it would help much. It would be 
> better to interrupt waiting lockers when the connection is lost. If an 
> internal mechanism forced any locks blocked in wait() to throw an exception 
> then those lock threads will already delete their ZNodes and try again. This 
> would be the best solution maybe? So, before the lock code goes to wait() it 
> sets a ConnectionStateListener (or something similar) that will interrupt the 
> thread when there are connection problems.
> 
> Leader Latch code is well protected
> 
> Yes - the leader recipes are better at this. Maybe there's an opportunity to 
> merge/change code. We could consider deprecating the lock recipes in favor of 
> these?
> 
> -Jordan
> 
>> On Jul 29, 2021, at 2:35 AM, Viswanathan Rajagopal 
>> mailto:viswanathan.rajag...@workday.com>> 
>> wrote:
>> 
>> Hi Jordan,
>>  
>> Thanks for the suggestions. Much helpful.
>>  
>> Quick summary of my below comments,
>> Added test case simulator to reproduce issue (though artificially)
>> Added a proposed code just to get your review and suggestions to see whether 
>> that would work
>> Please find my detailed comments inline,
>>  
>> > I think there's generally a hole in Curator's lock recipe. The lock code 
>> > does not watch the lock node it has created. So, another process or (as 
>> > you found) a race with the server might cause the lock node to disappear 
>> > underneath the lock instance after it thinks it has the lock. One thing we 
>> > can do is to check the session ID before waiting in 
>> > LockInternals.internalLockLoop(). If the session ID changes after wait 
>> > exits it suggests the lock node should no longer be trusted and it should 
>> > be deleted and re-created. That's 1 though anyway.
>>  
>> Yes, you are right. Currently lock code doesn’t watch the lock node it has 
>> created. That’s where we would like to hear your feedback on our proposal 
>> mentioned in our first mail thread 
>> “Curator lock code has makeRevocable() API that enables application to 
>> revoke lock anytime from application by triggering NODE_CHANGE event through 
>> Revoker.attemptToRevoke() utility. 
>> Proposal: Would it be nice to extend makeRevocable() API to handle Node 
>> delete event, which would allow application to register the watcher for Node 
>> delete event, thereby application can react to Node delete event by revoking 
>> the lock ?”. Tried the Code snippet 
>> (https://gith

Re: [External Sender] Double Locking Issue

2021-07-30 Thread Jordan Zimmerman
> zkClient had reconnected and it still happens to see the ephermal node just 
> before the server deletes it since its session has expired, but the node is 
> deleted afterwards by the server.
> Approach #2 (A little manual interruption needed to reproduce the issue)
> Run the test case in Debug mode (SIMULATE_ISSUE_BY_EXPLICIT_DELETE set to 
> false)
> Artificially delaying / pausing the ephemeral lock nodes deletion as part of 
> session cleanup process in server code (ZookeeperServer.close() method)
> After a pause (say 5s) to make one of the instance to acquire the lock, 
> Artificially break the socket connection between client and server for 30s 
> (by keeping breakpoint in ClientCnxnSocketNIO.doIO() method). After 30s, we 
> would see session closing logs logged in server code
> After 1min, remove breakpoint in ClientCnxnSocketNIO.doIO() and resume both 
> Thread 2 and Thread 3
> After that, resume server thread (thread name would be “SessionTracker”
> Below Proposals discussed so far (3rd added now for your review)
> Making lock code watch its lock node it has created
> Session ID changes after wait exits it suggests the lock node should no 
> longer be trusted and it should be deleted and re-created
> Leader Latch code is well protected to cover this zookeeper race condition, 
> because Leader Latch code internally handle the connection events (which they 
> use to interrupt latch_acquire_state to reset its latch every time connection 
> is reconnected), means it will explicitly release the latch and recreate new 
> if there is a connection disconnect (may be this can be the approach that 
> lock recipe could use to protect ?) 
> Many Thanks,
> Viswa.
>  
> From: Jordan Zimmerman 
> Date: Tuesday, 20 July 2021 at 21:11
> To: Viswanathan Rajagopal 
> Cc: Sean Gibbons , d...@curator.apache.org 
> , cammcken...@apache.org , 
> user@curator.apache.org , Donal Arundel 
> , Francisco Goncalves 
> , Zak Szadowski , 
> Dandie Beloy , Marcelo Cenerino 
> 
> Subject: Re: [External Sender] Double Locking Issue
> 
> In our case, it wouldn’t throw any exception because it had gone past 
> “creating lock nodes” and was blocked on wait(), which would only then be 
> interrupted when curator watcher notified on previous sequence node delete 
> event.
>  
> So, you're using the version of acquire() without a timeout? In any event, 
> this is a problem. When you receive SUSPENDED you really should interrupt any 
> threads that are waiting on Curator. The Curator docs imply this even though 
> it might not be obvious. This is likely the source of your problems. A simple 
> solution is to use the version of acquire that has a timeout and repeatedly 
> call it until success (though your problem may still occur in this case). 
> Maybe we could improve the lock recipe (or something similar) so that locks 
> inside of acquire are interrupted on a network partition. 
>  
> [snip of your remaining thoughtful analysis]
>  
> I think there's generally a hole in Curator's lock recipe. The lock code does 
> not watch the lock node it has created. So, another process or (as you found) 
> a race with the server might cause the lock node to disappear underneath the 
> lock instance after it thinks it has the lock. 
>  
> One thing we can do is to check the session ID before waiting in 
> LockInternals.internalLockLoop(). If the session ID changes after wait exits 
> it suggests the lock node should no longer be trusted and it should be 
> deleted and re-created. That's 1 though anyway. It would be nice to generate 
> a test/simulation for this case so that it can be properly dealt with.
>  
> -Jordan
> 
> 
> On Jul 20, 2021, at 3:53 PM, Viswanathan Rajagopal 
> mailto:viswanathan.rajag...@workday.com>> 
> wrote:
>  
> Thanks Jordan for coming back on this.
>  
> Please find my inline comments. Also provided below few additional version 
> info that we are using,
> Curator version – 2.5.0
> Zookeeper version – 3.5.3
>  
> >  I don't see how this is the case. If the client has received a network 
> > partition it shouldn't not consider any locks as being currently held (see 
> > the Tech Notes I mentioned in my previous email). If there is a partition 
> > during a call to acquire(), acquire() would throw an exception (once the 
> > retry policy has expired). BTW, what retry policy are you using?
>  
> In our case, it wouldn’t throw any exception because it had gone past 
> “creating lock nodes” and was blocked on wait(), which would only then be 
> interrupted when curator watcher notified on previous sequence node delete 
> event.
>  
> > This isn't correct. Curator watches the previous node but lock acquisition 
> > is always b

Re: LeaderSelector Recipe Can Result In Orphaned Ephemeral Nodes And No Leader Election

2021-07-30 Thread Jordan Zimmerman
Curator has a feature called "protected mode". It adds a UUID to the node name 
and when there is connection issue or other connection it tries to find the 
node it created (see here: 
https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601
 
).
 I wonder why this mechanism is getting defeated. It would be nice to get a 
test simulation that reproduces this. It's possible that the retry policy is 
expiring and FindAndDeleteProtectedNodeInBackground is giving up and rethrowing 
the exception. What is your retry policy?

-Jordan

> On Jul 30, 2021, at 12:29 AM, H S  wrote:
> 
> Hi,
> 
> While using the LeaderSelector recipe I noticed what appears to be an issue 
> where under some circumstances during zookeeper failover or network issues, 
> orphaned ephemeral nodes are created resulting in no leader election for a 
> cluster. I have reproduced this issue in versions 5.1.0 and 5.2.0.
> 
> What appears to happening is the following:
>   1) Node A is attempting to acquire the interprocess lock.
>   2) It attempts to create its ephemeral node by calling 
> StandardLockInternalsDriver.createsTheLock
>   3) The zookeeper client issues the request to the zookeeper server
>   4) The zookeeper server creates the ephemeral node
>   5) While the response is being returned from the server to the zookeeper 
> client, the channel is broken, resulting in an EndOfStreamException.
>   6) This results in an unhandled ConnectionLossException propagating all the 
> way up the LeaderSelector.internalRequeue call stack, killing the submitted 
> task without deleting the created ephemeral node
>   7) The zookeeper session for the client is still valid, resulting in the 
> ephemeral node remaining orphaned indefinitely.
>   8) During all subsequent requeue attempts the orphaned node is a 
> predecessor of all nodes and treated as if it is the leader, however, it's 
> not running because it errored out before calling the selector listener.
>   9) Currently the only way to resolve the issue appears to be to check the 
> number of participants around any failover occurrences and if more 
> participants are listed than nodes, the framework session associated with the 
> extra participants must be closed to invalidate its session and delete the 
> orphaned node.
> 
> I have recreated the issue by repeatedly restarting the nodes in a zookeeper 
> cluster to simulate failover until the orphaned nodes can be seen using 'echo 
> dump | nc zookeeperHost zookeeperPort'
> 
> I turned on debugging when reproducing the issue and below is the sample log 
> from the IO error and the associated uncaught thread exception.
> 
> 2021-07-29 22:23:50.367-0500 WARN APP= COMP= (localhost:2182) ClientCnxn 
> Session 0x200086ade23 for sever localhost/0:0:0:0:0:0:0:1:2183, Closing 
> socket connection. Attempting reconnect except it is a 
> SessionExpiredException.
> org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
> additional data from server sessionid 0x200086ade23, likely server has 
> closed socket
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)
> 2021-07-29 22:23:50.475-0500 DEBUG APP= COMP= LeaderSelector-1 RetryLoopImpl 
> Retry-able exception received
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /MyApp/MyLeaderKey/_c_15e65f7d-4f53-4213-a915-16d3aa318c90-lock-
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1837)
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193)
>  [110:curator-framework:5.2.0]
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) 
> [109:curator-client:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573)
>  [110:curator-framework:5.2.0]
> at 
> 

Re: [External Sender] Double Locking Issue

2021-07-20 Thread Jordan Zimmerman
lete its lock node 
> N1) and attempts to acquire lock by creating lock node N3 and watches for 
> previous sequence node (N2)
> Client B who was blocked on acquire() -> wait() would be notified with 
> previous sequence node (N1) deletion -> getChildren, sorting them, and seeing 
> if the lock's node is the first node in the sequence. So, Client B sees its 
> lock node N2 still (which I call it as about to be deleted node by server) 
> and thereby acquires the lock
>  
> [ AFTER FEW SECONDS ] :
> Server managed to delete the ephemeral node N2 as part of previous client 
> session cleanup
> Client A who was blocked on acquire() -> wait() would be notified with 
> previous sequence node (N2) deletion -> getChildren, sorting them, and seeing 
> if the lock's node is the first node in the sequence and thereby acquires the 
> lock
> Client B – its local lock thread data went stale (as its lock path N2 not has 
> been deleted by Server)
>  
> >  SUSPENDED in Curator means only that the connect has been lost, not that 
> > the session has ended.
> LOST is the state that means the session has ended
> Be aware of how GCs can affect Curator. See the Tech Note here: 
> https://cwiki.apache.org/confluence/display/CURATOR/TN10 
> <https://cwiki.apache.org/confluence/display/CURATOR/TN10>
> <https://cwiki.apache.org/confluence/display/CURATOR/TN10 
> <https://cwiki.apache.org/confluence/display/CURATOR/TN10>>
> Also read this Tech Note on session handling: 
> https://cwiki.apache.org/confluence/display/CURATOR/TN14 
> <https://cwiki.apache.org/confluence/display/CURATOR/TN14>
> <https://cwiki.apache.org/confluence/display/CURATOR/TN14 
> <https://cwiki.apache.org/confluence/display/CURATOR/TN14>>
>  
> Very good information on the tech notes. Thanks for that. Agreed, it’s always 
> recommended to clear the locks when it sees SUSPENDED event. And we are 
> already following your recommendation. Our application would clear the lock 
> when it sees SUSPENDED event.
>  
> To summarize my thoughts,
> Ephemeral node created by previous session was  still seen by client that 
> reconnected with new session id until server cleans that up the previous 
> session ephemeral node. This could happen if client manage to reconnect with 
> server with new session id before server cleans up the previous session. How 
> it affects the curator lock recipe ? Explanation : The above explained race 
> condition would make acquire() to hold the lock ( as its own lock node still 
> seen ), eventually leading to inconsistent state (i.e. curator local lock 
> state stale) when that lock node is being cleaned up by the server as part of 
> previous session cleanup activities.
> I am NOT seeing a Curator bug here, but looking out for suggestions / 
> recommendations in handling this zookeeper race condition. Either can a 
> feature be added to Curator to cover this case / any recommendations that 
> clients should follow.
>  
> Many Thanks,
> Viswa.
>  
> From: Sean Gibbons  <mailto:sean.gibb...@workday.com>>
> Date: Tuesday, 20 July 2021 at 12:19
> To: Viswanathan Rajagopal  <mailto:viswanathan.rajag...@workday.com>>
> Subject: FW: [External Sender] Re: Double Locking Issue
> 
> 
> 
> On 20/07/2021, 10:12, "Jordan Zimmerman"  <mailto:jor...@jordanzimmerman.com>> wrote:
> 
> A few more things...
> 
> > Based on our initial analysis and few test runs, we saw that Curator 
> acquire() method acquires the lock based on “about to be deleted lock node of 
> previous session”. Explanation : Ephemeral node created by previous session 
> was  still seen by client that reconnected with new session id until server 
> cleans that up. If this happens, Curator acquire() would hold the lock.
> 
> This isn't correct. Curator watches the previous node but lock 
> acquisition is always based on calling ZK's getChildren, sorting them, and 
> seeing if the lock's node is the first node in the sequence. If Ephemeral 
> nodes aren't cleaned up it wouldn't be a problem. Any phantom ephemeral nodes 
> would sort first and prevent Curator from believing it holds the lock.
> 
> >  *   On the above mentioned race condition, if client manage to 
> reconnect to server with new session id before server cleans up the ephemeral 
> nodes of client’s previous session,  Curator lock acquire() who is trying to 
> acquire the lock will hold the lock as it still sees the lock node in 
> zookeeper directory. Eventually server would be cleaning up the ephemeral 
> nodes leaving the Curator local lock thread data stale giving the illusion 
> that it still hold the lock while its ephemeral node is gone
> 
> I don't

Re: Double Locking Issue

2021-07-20 Thread Jordan Zimmerman
A few more things...

> Based on our initial analysis and few test runs, we saw that Curator 
> acquire() method acquires the lock based on “about to be deleted lock node of 
> previous session”. Explanation : Ephemeral node created by previous session 
> was  still seen by client that reconnected with new session id until server 
> cleans that up. If this happens, Curator acquire() would hold the lock.

This isn't correct. Curator watches the previous node but lock acquisition is 
always based on calling ZK's getChildren, sorting them, and seeing if the 
lock's node is the first node in the sequence. If Ephemeral nodes aren't 
cleaned up it wouldn't be a problem. Any phantom ephemeral nodes would sort 
first and prevent Curator from believing it holds the lock.

>  *   On the above mentioned race condition, if client manage to reconnect to 
> server with new session id before server cleans up the ephemeral nodes of 
> client’s previous session,  Curator lock acquire() who is trying to acquire 
> the lock will hold the lock as it still sees the lock node in zookeeper 
> directory. Eventually server would be cleaning up the ephemeral nodes leaving 
> the Curator local lock thread data stale giving the illusion that it still 
> hold the lock while its ephemeral node is gone

I don't see how this is the case. If the client has received a network 
partition it shouldn't not consider any locks as being currently held (see the 
Tech Notes I mentioned in my previous email). If there is a partition during a 
call to acquire(), acquire() would throw an exception (once the retry policy 
has expired). BTW, what retry policy are you using?

So, to reiterate, I don't see how phantom/undeleted ephemeral nodes would cause 
a problem. The only problem it could case is that a given Curator client takes 
longer to acquire a lock as it waits for those ephemerals to finally get 
deleted.

-JZ

> On Jul 19, 2021, at 6:45 PM, Viswanathan Rajagopal 
>  wrote:
> 
> Hi Team,
> 
> Good day.
> 
> Recently came across “Double Locking Issue (i.e. two clients acquiring lock)” 
> using Curator code ( InterProcessMutex lock APIs ) in our application
> 
> Our use case:
> 
>  *   Two clients attempts to acquire the zookeeper lock using Curator 
> InterProcessMutex and whoever owns it would release it once sees the 
> connection disconnect ( on receiving Connection.SUSPENDED / Connection.LOST 
> Curator Connection Events from Connection listener)
> 
> Issue we noticed:
> 
>  *   After session expired & reconnected with new session, both client seems 
> to have acquired the lock. Interesting thing that we found is that one of the 
> clients still holds the lock while its lock node (ephemeral) was gone
> 
> Things we found:
> 
>  *   Based on our initial analysis and few test runs, we saw that Curator 
> acquire() method acquires the lock based on “about to be deleted lock node of 
> previous session”. Explanation : Ephemeral node created by previous session 
> was  still seen by client that reconnected with new session id until server 
> cleans that up. If this happens, Curator acquire() would hold the lock.
> 
> 
> 
>  *   Clearly we could see the race condition (in zookeeper code) between 1). 
> Client reconnecting to server with new session id and 2). server deleting the 
> ephemeral nodes of client’s previous session. We were able to reproduce this 
> issue using the following approach,
> *   Artificially break the socket connection between client and server 
> for 30s
> *   Artificially pausing the set of server codes for a min and unpause
> 
> 
>  *   On the above mentioned race condition, if client manage to reconnect to 
> server with new session id before server cleans up the ephemeral nodes of 
> client’s previous session,  Curator lock acquire() who is trying to acquire 
> the lock will hold the lock as it still sees the lock node in zookeeper 
> directory. Eventually server would be cleaning up the ephemeral nodes leaving 
> the Curator local lock thread data stale giving the illusion that it still 
> hold the lock while its ephemeral node is gone
> 
> 
>  *   Timeline events described below for better understanding,
> *   At t1, Client A and Client B establishes zookeeper session with 
> session id A1 and B1 respectively
> *   At t2, Client A creates the lock node N1 & acquires the lock
> *   At t3, Client B creates the lock node N2 & blocked on acquire() to 
> acquire the lock
> *   At t4, session timed out for both clients & server is about to clean 
> up the old session • Client A trying to release the lock
> *   At t5, Client A and Client B reconnects to server with new session id 
> A2 and B2 respectively before server deletes the ephemeral node N1 & N2 of 
> previous client session. Client A releases the lock, deleting N1 and trying 
> to acquire it again by creating N3 node and Client B who is blocked on 
> acquire() acquires the lock based on N2 (about to be deleted node created by 
> 

Re: CuratorCache and sub path's

2021-06-29 Thread Jordan Zimmerman
Those sound like real bugs to me. I'd have to see examples, though, to know for 
sure.

-Jordan

> On Jun 29, 2021, at 12:32 PM, Ryan Ruel  wrote:
> 
> Thanks for the prompt reply, Jordan!
> 
> So the expectation then is that “this should work” (or be made to work)?
> 
> If so, submitting upstream PR’s is something I’d certainly be willing to do!
> 
> /Ryan
> 
>> On Jun 29, 2021, at 7:30 AM, Jordan Zimmerman  
>> wrote:
>> 
>> Hi,
>> 
>> Unfortunately, CuratorCache has not been testing in real world scenarios 
>> very much. So, these may actually be bugs. If it's not behaving as you think 
>> it should please open an issue and hopefully a PR. If you're unable to write 
>> the PR maybe I or one of the other committers can get to it.
>> 
>> -Jordan
>> 
>>> On Jun 29, 2021, at 12:27 PM, Ryan Ruel  wrote:
>>> 
>>> I'm building an application using ModeledFrameworks and CuratorCache.
>>> 
>>> As I expect to have a large number of ZNodes (300k-1m potentially) I have 
>>> my data fanned out into a ZPath structure:
>>> 
>>> /foo/bar/thing1
>>> /foo/baz/thing2
>>> /foo/buzz/thing3
>>> 
>>> etc.  (in the real application, these paths are a few depths deeper, but I 
>>> think the point is the same).
>>> 
>>> The CuratorCache instance is watching/caching all data under parent ZPath 
>>> /foo, and "thing1-3" are of my ModeledFramework type.
>>> 
>>> This is mostly working fine, but I've run into a few issues with Curator 
>>> that have me questioning whether CuratorCache is supposed to be used with 
>>> sub-paths.
>>> 
>>> For example, if I try to use the "childrenAsZNodes()" method (defined in 
>>> the CachedModeledFramework interface) it fails as the implementation is 
>>> using a filter for the ZPath which doesn't expect "parent()" to be anything 
>>> but /foo (personally I'd have expected it to filter on Zpath startsWith() 
>>> /foo).
>>> 
>>> Additionally, I see that when I delete any of the ZNodes for the 
>>> intermediary path objects (such as /foo/bar) I receive an exception in my 
>>> application as curator is trying to deserialize /foo/bar with null bytes 
>>> (as that ZNode doesn't actually contain one of my ModeledFramework objects, 
>>> it's and empty placeholder).  
>>> 
>>> Note that my application doesn't create the sub-path ZNodes, but is rather 
>>> relying on the "Create Parents If Needed" create mode option.
>>> 
>>> I suppose an option would be for my application to create the intermediary 
>>> ZNodes in the ZPath with something that Jackson can actually deserialize, 
>>> but that is additional burden on ZooKeeper I'd like to avoid at scale.
>>> 
>>> Can anyone comment on if this is an appropriate use of CuratorCache?
>>> 
>>> /Ryan
>>> 
>>> 
>> 
> 



Re: CuratorCache and sub path's

2021-06-29 Thread Jordan Zimmerman
Hi,

Unfortunately, CuratorCache has not been testing in real world scenarios very 
much. So, these may actually be bugs. If it's not behaving as you think it 
should please open an issue and hopefully a PR. If you're unable to write the 
PR maybe I or one of the other committers can get to it.

-Jordan

> On Jun 29, 2021, at 12:27 PM, Ryan Ruel  wrote:
> 
> I'm building an application using ModeledFrameworks and CuratorCache.
> 
> As I expect to have a large number of ZNodes (300k-1m potentially) I have my 
> data fanned out into a ZPath structure:
> 
> /foo/bar/thing1
> /foo/baz/thing2
> /foo/buzz/thing3
> 
> etc.  (in the real application, these paths are a few depths deeper, but I 
> think the point is the same).
> 
> The CuratorCache instance is watching/caching all data under parent ZPath 
> /foo, and "thing1-3" are of my ModeledFramework type.
> 
> This is mostly working fine, but I've run into a few issues with Curator that 
> have me questioning whether CuratorCache is supposed to be used with 
> sub-paths.
> 
> For example, if I try to use the "childrenAsZNodes()" method (defined in the 
> CachedModeledFramework interface) it fails as the implementation is using a 
> filter for the ZPath which doesn't expect "parent()" to be anything but /foo 
> (personally I'd have expected it to filter on Zpath startsWith() /foo).
> 
> Additionally, I see that when I delete any of the ZNodes for the intermediary 
> path objects (such as /foo/bar) I receive an exception in my application as 
> curator is trying to deserialize /foo/bar with null bytes (as that ZNode 
> doesn't actually contain one of my ModeledFramework objects, it's and empty 
> placeholder).  
> 
> Note that my application doesn't create the sub-path ZNodes, but is rather 
> relying on the "Create Parents If Needed" create mode option.
> 
> I suppose an option would be for my application to create the intermediary 
> ZNodes in the ZPath with something that Jackson can actually deserialize, but 
> that is additional burden on ZooKeeper I'd like to avoid at scale.
> 
> Can anyone comment on if this is an appropriate use of CuratorCache?
> 
> /Ryan
> 
> 



[ANNOUNCE] Enrico Olivelli is the new PMC Chair

2020-11-18 Thread Jordan Zimmerman
Hello Everyone,

Congratulations to Enrico Olivelli who has been nominated and has accepted
the position of PMC Chair for Apache Curator taking over from me. I'm
certain that Curator is in excellent hands with Enrico and we all truly
appreciate his willingness to take on the Chair.

I have been the Chair since May of 2013 - that's a long time :D I'm
thrilled to pass the "gavel" to Enrico. I'm not going anywhere and will
still contribute as I can to the project.

Sincerely,

Jordan Zimmerman


Re: Curator PersistentNode recipe

2020-10-04 Thread Jordan Zimmerman
Yes - the node is deleted ala:

client.delete().guaranteed().forPath(localNodePath);

So, if the connection is currently lost the path is added to an internal queue 
and deleted when the connection is repaired.

-Jordan

> On Oct 3, 2020, at 12:17 PM, evaristo.camar...@yahoo.es wrote:
> 
> Hi there,
> 
> I am using Curator's 4.3.0 PersistentNode recipe to keep an ephemeral node 
> against a ZK cluster.
> 
> In some circumstances I need to close the PersistentNode instance using 
> close() method when the connection to the ZK cluster could be LOST. In that 
> scenario, the close() method is throwing an IOException that is wrapping a 
> ConnectionLossException and basically removeWatches() call is not done...
> 
> Checking the code, my understanding is that once connection with the cluster 
> happens again, watches are removed and node is deleted (even when the recipe 
> was closed)...
> 
> I would appreciate if someone could confirm that my assumption is correct and 
> that is the expected behaviour of the recipe
> 
> 
> Thanks in advance,
> 
> /Evaristo
> 



Re: REOPEN CURATOR-538?

2020-07-23 Thread Jordan Zimmerman
I could've sworn we addressed this as part of 5.0 - I'll do some research or 
maybe someone else can comment.

-JZ

> On Jul 23, 2020, at 11:53 AM, evaristo.camar...@yahoo.es wrote:
> 
> Hi there,
> 
> I propose to re-open Curator - 538
> 
> [CURATOR-538] Background exception was not retry-able or retry gave up - ASF 
> JIRA 
> 
> [CURATOR-538] Background exception was not retry-able or retry gave up -...
>  
> 
> 
> I am experiencing the same issue (NPE) on a K8S deployment with Curator 4.3.0
> Jordan, if I understand right your comment you are proposing to check if 
> server.addr.getAddress() is null...
> Currently with the NPE it is enough that a single server in the loop throws 
> the NPE to prevent that the connectString is modified. I can provide a patch 
> to avoid the NPE.
> I see 2 options:
> 1.- Keep a compatible behavior, that with a single server.addr.getAddress() 
> resolution to null the method return an empty string and therefore the 
> connecstring will be unmodified (I can add a LOG warn statement).
> 2.- Another alternative is to continue the loop till the end ignoring servers 
> that are resolved to null...
> What do you think is best?
>  Regards,
> /Chevaris
> 
> 



Re: Curator - Zookeeper compatibility

2020-06-11 Thread Jordan Zimmerman
4.3.0 will work. We should updated the website.

> On Jun 11, 2020, at 10:12 AM, Borja Bravo Alférez  wrote:
> 
> Dear Curator community, 
> 
> I have a question about the Curator - Zookeeper compatibility with Zookeeper 
> 3.4.x. Could I use Curator 4.3.0 with it?
> 
> For one side there is this doc that clearly says that last curator version 
> with support is 4.2.x 
> https://curator.apache.org/zk-compatibility-34.html 
> 
> 
> On the other hand there is this compatibility matrix specifying that the 
> whole 4.0 branch is compatible. It seems that the 4.3.0 version was released 
> to fix some bugs and the reason to maintain this branch is to keep 
> compatibility with the old zookeeper. Maybe the previous link just needed and 
> update.
> 
> https://cwiki.apache.org/confluence/display/CURATOR/Releases 
> 
> 
> Regards,
> 
> Borja
> 
> 



Re: CURATOR-569 MISSING IN CURATOR 5.0.0 RELEASE NOTES

2020-06-01 Thread Jordan Zimmerman
Fixed

> On Jun 1, 2020, at 10:25 AM, evaristo.camar...@yahoo.es wrote:
> 
> Hi there,
> 
> I saw that Curator-569 issue was added to 5.0.0 release, but it is missing in 
> release notes.
> 
> Regards,
> 
> /Evaristo



Re: Curator 5.0 binary compatibility

2020-05-27 Thread Jordan Zimmerman
BookKeeper is not using the two recipes then, NodeCache and PathChildrenCache. 
I think we can safely release and document the workaround for those that need 
it.

-Jordan

> On May 27, 2020, at 9:51 AM, Enrico Olivelli  wrote:
> 
> Jordan and Cameron
> I have double checked most of the projects I am aware and finally I came to
> the conclusion that upgrading to 5.0 in all of my cases is safe.
> 
> I only add a nuisance on Apache BookKeeper because we are building with
> -Werror and so using deprecated APIs is seen as an error, but it is not a
> binary compatibility issue.
> 
> I have also run locally tests of the staged sources.
> I am casting my +1
> 
> I think that this discussion as been useful for the community
> 
> Thank you
> Enrico
> 
> Il giorno mer 27 mag 2020 alle ore 00:55 Jordan Zimmerman <
> jor...@jordanzimmerman.com <mailto:jor...@jordanzimmerman.com>> ha scritto:
> 
>> If that comes to pass I'll write a new Tech Note on the wiki on how to
>> create a Compatibility JAR.
>> 
>> -JZ
>> 
>>> On May 26, 2020, at 5:41 PM, Cameron McKenzie 
>> wrote:
>>> 
>>> I will release on Sunday if we haven't got any feedback indicating we
>>> should do otherwise.
>>> 
>>> On Wed, May 27, 2020 at 6:32 AM Enrico Olivelli > <mailto:eolive...@gmail.com <mailto:eolive...@gmail.com>>> wrote:
>>> 
>>>> 
>>>> 
>>>> Il giorno mar 26 mag 2020 alle ore 21:53 Jordan Zimmerman <
>>>> jor...@jordanzimmerman.com <mailto:jor...@jordanzimmerman.com>> ha scritto:
>>>> 
>>>>> The initial email is almost a week old. I posted on Curator's Twitter
>>>>> account too. No response from anyone other than me Cameron and Enrico.
>> Very
>>>>> discouraging. Let's give it to the end of the week. If no one responds
>> I
>>>>> suggest we release and tell people how to work around it if they need
>> to.
>>>>> 
>>>> 
>>>> Agreed
>>>> 
>>>> Enrico
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> -JZ
>>>>> 
>>>>> On May 25, 2020, at 9:10 AM, Jordan Zimmerman <
>> jor...@jordanzimmerman.com <mailto:jor...@jordanzimmerman.com>>
>>>>> wrote:
>>>>> 
>>>>> If most of the problem is about ListenerContainer, don't we have a way
>> to
>>>>> keep it and emulate it using and implementation based on the new API ?
>>>>> 
>>>>> 
>>>>> Unfortunately not. The issue is that it leaks a Guava class (Function)
>> in
>>>>> its API. See here:
>>>>> 
>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java#L87
>>  
>> <https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java#L87>
>>>>> 
>>>>> -JZ
>>>>> 
>>>>> On May 25, 2020, at 1:33 AM, Enrico Olivelli >>>> <mailto:eolive...@gmail.com>>
>> wrote:
>>>>> 
>>>>> Il giorno dom 24 mag 2020 alle ore 23:17 Jordan Zimmerman <
>>>>> jor...@jordanzimmerman.com <mailto:jor...@jordanzimmerman.com>> ha 
>>>>> scritto:
>>>>> 
>>>>> Enrico,
>>>>> 
>>>>> It reminds me of the breaking changes in Guava and other widely used
>>>>> libraries.
>>>>> 
>>>>> 
>>>>> 
>>>>> In fact Guava is terrible for people (like in my company) that deal
>> with
>>>>> lots of third party dependencies.
>>>>> 
>>>>> 
>>>>> The problem for us is that we can never change our APIs if this is the
>>>>> case. Note that ListenerContainer has been marked deprecated since
>> 4.1.1 (
>>>>> 
>>>>> 
>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java
>>  
>> <https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java>
>>>>> <
>>>>> 
>>>>> 
>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/cur

Re: Curator 5.0 binary compatibility

2020-05-26 Thread Jordan Zimmerman
If that comes to pass I'll write a new Tech Note on the wiki on how to create a 
Compatibility JAR.

-JZ

> On May 26, 2020, at 5:41 PM, Cameron McKenzie  wrote:
> 
> I will release on Sunday if we haven't got any feedback indicating we
> should do otherwise.
> 
> On Wed, May 27, 2020 at 6:32 AM Enrico Olivelli  <mailto:eolive...@gmail.com>> wrote:
> 
>> 
>> 
>> Il giorno mar 26 mag 2020 alle ore 21:53 Jordan Zimmerman <
>> jor...@jordanzimmerman.com> ha scritto:
>> 
>>> The initial email is almost a week old. I posted on Curator's Twitter
>>> account too. No response from anyone other than me Cameron and Enrico. Very
>>> discouraging. Let's give it to the end of the week. If no one responds I
>>> suggest we release and tell people how to work around it if they need to.
>>> 
>> 
>> Agreed
>> 
>> Enrico
>> 
>> 
>> 
>>> 
>>> -JZ
>>> 
>>> On May 25, 2020, at 9:10 AM, Jordan Zimmerman 
>>> wrote:
>>> 
>>> If most of the problem is about ListenerContainer, don't we have a way to
>>> keep it and emulate it using and implementation based on the new API ?
>>> 
>>> 
>>> Unfortunately not. The issue is that it leaks a Guava class (Function) in
>>> its API. See here:
>>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java#L87
>>> 
>>> -JZ
>>> 
>>> On May 25, 2020, at 1:33 AM, Enrico Olivelli  wrote:
>>> 
>>> Il giorno dom 24 mag 2020 alle ore 23:17 Jordan Zimmerman <
>>> jor...@jordanzimmerman.com> ha scritto:
>>> 
>>> Enrico,
>>> 
>>> It reminds me of the breaking changes in Guava and other widely used
>>> libraries.
>>> 
>>> 
>>> 
>>> In fact Guava is terrible for people (like in my company) that deal with
>>> lots of third party dependencies.
>>> 
>>> 
>>> The problem for us is that we can never change our APIs if this is the
>>> case. Note that ListenerContainer has been marked deprecated since 4.1.1 (
>>> 
>>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java
>>> <
>>> 
>>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java
>>> 
>>> ).
>>> 
>>> 
>>> 
>>> I didn't check the code yet, I am sorry, so maybe I saying something that
>>> is not doable.
>>> If most of the problem is about ListenerContainer, don't we have a way to
>>> keep it and emulate it using and implementation based on the new API ?
>>> 
>>> as Jordan said, any other comment from the community will be very
>>> appreciated, maybe we are talking about smoke.
>>> 
>>> Enrico
>>> 
>>> 
>>> 
>>> So, we're really left with these options:
>>> 
>>> Release Curator 5.0 and let the issues fall onto those with compatibility
>>> problems
>>> Bundle or refer to a compatibility JAR that is put early in the CLASSPATH
>>> as I outlined in my test project
>>> Move Curator 5.0 to a new package so that it can exist in the same JVM as
>>> earlier versions of Curator.
>>> Backout the change and mark the APIs as deprecated and push the problem to
>>> a future version
>>> 
>>> -Jordan
>>> 
>>> On May 24, 2020, at 3:58 PM, Enrico Olivelli 
>>> 
>>> wrote:
>>> 
>>> 
>>> 
>>> 
>>> Il Dom 24 Mag 2020, 22:48 Cameron McKenzie >> 
>>> <mailto:cammcken...@apache.org <mailto:cammcken...@apache.org> 
>>> mailto:cammcken...@apache.org>>>> ha scritto:
>>> 
>>> Enrico,
>>> Can you explain your environment that exposes these backwards
>>> compatibility issues?
>>> 
>>> Cameron,
>>> Let's say we have two libraries Foo and Bar that are compiled for
>>> 
>>> Curator 4.x.
>>> 
>>> 
>>> I am now using in my Application Baz that use both Foo and Bar. So I
>>> 
>>> have Curator 4.x on the classpath.
>>> 
>>> Developers of Foo want to move to Curator 5.x in Foo 2.0, but Bar is
>>> 
>>> still happy with Curator 4.x.
>>> 
>>&

Re: Curator 5.0 binary compatibility

2020-05-26 Thread Jordan Zimmerman
The initial email is almost a week old. I posted on Curator's Twitter account 
too. No response from anyone other than me Cameron and Enrico. Very 
discouraging. Let's give it to the end of the week. If no one responds I 
suggest we release and tell people how to work around it if they need to.

-JZ

> On May 25, 2020, at 9:10 AM, Jordan Zimmerman  
> wrote:
> 
>> If most of the problem is about ListenerContainer, don't we have a way to
>> keep it and emulate it using and implementation based on the new API ?
> 
> 
> Unfortunately not. The issue is that it leaks a Guava class (Function) in its 
> API. See here: 
> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java#L87
>  
> <https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java#L87>
> 
> -JZ
> 
>> On May 25, 2020, at 1:33 AM, Enrico Olivelli > <mailto:eolive...@gmail.com>> wrote:
>> 
>> Il giorno dom 24 mag 2020 alle ore 23:17 Jordan Zimmerman <
>> jor...@jordanzimmerman.com <mailto:jor...@jordanzimmerman.com>> ha scritto:
>> 
>>> Enrico,
>>> 
>>> It reminds me of the breaking changes in Guava and other widely used
>>> libraries.
>> 
>> 
>> In fact Guava is terrible for people (like in my company) that deal with
>> lots of third party dependencies.
>> 
>> 
>>> The problem for us is that we can never change our APIs if this is the
>>> case. Note that ListenerContainer has been marked deprecated since 4.1.1 (
>>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java
>>>  
>>> <https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java>
>>> <
>>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java
>>>  
>>> <https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java>
>>>> ).
>>> 
>> 
>> I didn't check the code yet, I am sorry, so maybe I saying something that
>> is not doable.
>> If most of the problem is about ListenerContainer, don't we have a way to
>> keep it and emulate it using and implementation based on the new API ?
>> 
>> as Jordan said, any other comment from the community will be very
>> appreciated, maybe we are talking about smoke.
>> 
>> Enrico
>> 
>> 
>>> 
>>> So, we're really left with these options:
>>> 
>>> Release Curator 5.0 and let the issues fall onto those with compatibility
>>> problems
>>> Bundle or refer to a compatibility JAR that is put early in the CLASSPATH
>>> as I outlined in my test project
>>> Move Curator 5.0 to a new package so that it can exist in the same JVM as
>>> earlier versions of Curator.
>>> Backout the change and mark the APIs as deprecated and push the problem to
>>> a future version
>>> 
>>> -Jordan
>>> 
>>>> On May 24, 2020, at 3:58 PM, Enrico Olivelli >>> <mailto:eolive...@gmail.com>>
>>> wrote:
>>>> 
>>>> 
>>>> 
>>>> Il Dom 24 Mag 2020, 22:48 Cameron McKenzie >>> <mailto:cammcken...@apache.org>
>>> <mailto:cammcken...@apache.org <mailto:cammcken...@apache.org>>> ha scritto:
>>>> Enrico,
>>>> Can you explain your environment that exposes these backwards
>>>> compatibility issues?
>>>> 
>>>> Cameron,
>>>> Let's say we have two libraries Foo and Bar that are compiled for
>>> Curator 4.x.
>>>> 
>>>> I am now using in my Application Baz that use both Foo and Bar. So I
>>> have Curator 4.x on the classpath.
>>>> Developers of Foo want to move to Curator 5.x in Foo 2.0, but Bar is
>>> still happy with Curator 4.x.
>>>> 
>>>> If I want to upgrade Foo to 2.0 I have these chances:
>>>> 1) Curator 5 is compatible with 4.x,so I can simply keep 5 and
>>> everything works
>>>> 2) Curator 5 is not compatible with 4.x so I can't have both (this is
>>> current case)
>>>> 3) Curator 5 is independent from 4.x and I can keep both of t

Re: Curator 5.0 binary compatibility

2020-05-25 Thread Jordan Zimmerman
> If most of the problem is about ListenerContainer, don't we have a way to
> keep it and emulate it using and implementation based on the new API ?


Unfortunately not. The issue is that it leaks a Guava class (Function) in its 
API. See here: 
https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java#L87
 
<https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java#L87>

-JZ

> On May 25, 2020, at 1:33 AM, Enrico Olivelli  wrote:
> 
> Il giorno dom 24 mag 2020 alle ore 23:17 Jordan Zimmerman <
> jor...@jordanzimmerman.com <mailto:jor...@jordanzimmerman.com>> ha scritto:
> 
>> Enrico,
>> 
>> It reminds me of the breaking changes in Guava and other widely used
>> libraries.
> 
> 
> In fact Guava is terrible for people (like in my company) that deal with
> lots of third party dependencies.
> 
> 
>> The problem for us is that we can never change our APIs if this is the
>> case. Note that ListenerContainer has been marked deprecated since 4.1.1 (
>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java
>>  
>> <https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java>
>> <
>> https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java
>>  
>> <https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java>
>>> ).
>> 
> 
> I didn't check the code yet, I am sorry, so maybe I saying something that
> is not doable.
> If most of the problem is about ListenerContainer, don't we have a way to
> keep it and emulate it using and implementation based on the new API ?
> 
> as Jordan said, any other comment from the community will be very
> appreciated, maybe we are talking about smoke.
> 
> Enrico
> 
> 
>> 
>> So, we're really left with these options:
>> 
>> Release Curator 5.0 and let the issues fall onto those with compatibility
>> problems
>> Bundle or refer to a compatibility JAR that is put early in the CLASSPATH
>> as I outlined in my test project
>> Move Curator 5.0 to a new package so that it can exist in the same JVM as
>> earlier versions of Curator.
>> Backout the change and mark the APIs as deprecated and push the problem to
>> a future version
>> 
>> -Jordan
>> 
>>> On May 24, 2020, at 3:58 PM, Enrico Olivelli 
>> wrote:
>>> 
>>> 
>>> 
>>> Il Dom 24 Mag 2020, 22:48 Cameron McKenzie > <mailto:cammcken...@apache.org <mailto:cammcken...@apache.org>>> ha scritto:
>>> Enrico,
>>> Can you explain your environment that exposes these backwards
>>> compatibility issues?
>>> 
>>> Cameron,
>>> Let's say we have two libraries Foo and Bar that are compiled for
>> Curator 4.x.
>>> 
>>> I am now using in my Application Baz that use both Foo and Bar. So I
>> have Curator 4.x on the classpath.
>>> Developers of Foo want to move to Curator 5.x in Foo 2.0, but Bar is
>> still happy with Curator 4.x.
>>> 
>>> If I want to upgrade Foo to 2.0 I have these chances:
>>> 1) Curator 5 is compatible with 4.x,so I can simply keep 5 and
>> everything works
>>> 2) Curator 5 is not compatible with 4.x so I can't have both (this is
>> current case)
>>> 3) Curator 5 is independent from 4.x and I can keep both of them
>>> 
>>> The best option for users is 1).
>>> 
>>> 3) is good anyway, but it needs more work for users that want to migrate.
>>> 
>>> Option 2) is not good. Users will have to shade/relocate Curator 5 or 4
>> and Foo 2.0 or Bar.
>>> 
>>> Hope that this explains better the problem
>>> Enrico
>>> 
>>> 
>>> I am probably coming from a place of ignorance, but I
>>> haven't seen new versions of a third party binary being dropped into an
>>> existing environment without recompiling the application, so I have never
>>> encountered these binary compatibility issues before. My expectation with
>>> this release was that if you wanted to pickup the changes in Curator 5.0
>>> that you would rebuild your application against the new binar

Re: Curator 5.0 binary compatibility

2020-05-24 Thread Jordan Zimmerman
Enrico,

It reminds me of the breaking changes in Guava and other widely used libraries. 
The problem for us is that we can never change our APIs if this is the case. 
Note that ListenerContainer has been marked deprecated since 4.1.1 
(https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java
 
<https://github.com/apache/curator/blob/apache-curator-4.1.1/curator-framework/src/main/java/org/apache/curator/framework/listen/ListenerContainer.java>).

So, we're really left with these options:

Release Curator 5.0 and let the issues fall onto those with compatibility 
problems
Bundle or refer to a compatibility JAR that is put early in the CLASSPATH as I 
outlined in my test project
Move Curator 5.0 to a new package so that it can exist in the same JVM as 
earlier versions of Curator. 
Backout the change and mark the APIs as deprecated and push the problem to a 
future version

-Jordan

> On May 24, 2020, at 3:58 PM, Enrico Olivelli  wrote:
> 
> 
> 
> Il Dom 24 Mag 2020, 22:48 Cameron McKenzie  <mailto:cammcken...@apache.org>> ha scritto:
> Enrico,
> Can you explain your environment that exposes these backwards
> compatibility issues?
> 
> Cameron,
> Let's say we have two libraries Foo and Bar that are compiled for Curator 4.x.
> 
> I am now using in my Application Baz that use both Foo and Bar. So I have 
> Curator 4.x on the classpath.
> Developers of Foo want to move to Curator 5.x in Foo 2.0, but Bar is still 
> happy with Curator 4.x.
> 
> If I want to upgrade Foo to 2.0 I have these chances:
> 1) Curator 5 is compatible with 4.x,so I can simply keep 5 and everything 
> works
> 2) Curator 5 is not compatible with 4.x so I can't have both (this is current 
> case)
> 3) Curator 5 is independent from 4.x and I can keep both of them
> 
> The best option for users is 1).
> 
> 3) is good anyway, but it needs more work for users that want to migrate.
> 
> Option 2) is not good. Users will have to shade/relocate Curator 5 or 4 and 
> Foo 2.0 or Bar.
> 
> Hope that this explains better the problem
> Enrico 
> 
> 
> I am probably coming from a place of ignorance, but I
> haven't seen new versions of a third party binary being dropped into an
> existing environment without recompiling the application, so I have never
> encountered these binary compatibility issues before. My expectation with
> this release was that if you wanted to pickup the changes in Curator 5.0
> that you would rebuild your application against the new binaries and then
> redeploy the application. Obviously this compilation will break if you are
> using any of the changed APIs, but they are pretty trivial change to fix.
> We could potentially deprecate the existing APIs and add the new ones, but
> this will produce more tech debt to clean up later.
> cheers
> 
> On Sat, May 23, 2020 at 7:40 PM Enrico Olivelli  <mailto:eolive...@gmail.com>> wrote:
> 
> > I will check you trick ad soon as possible. I am sorry, this is a very
> > busy week for me and do not have enough cycles. But I think that we should
> > address this problem in order to ease the adoption of the new code and APIs.
> >
> > Did you evaluate to eventually rollback the breaking changes?
> >
> > Another alternative, if we want to let users use both the old and the new
> > APIs is to simply rename all of the packages and start a brand new system.
> > This approach was done in Apache Commons and IIRC it will be done with
> > Netty5. We also did it with the new Apache Bookkeeper API.
> >
> > Pros:
> > No need to preserve compatibility, we are free to clean up all of the tech
> > debt.
> > The switch to Curator 5 will be explicit opted in
> >
> > Cons:
> > Cherry picks won't be straightforward.
> >
> > Enrico
> >
> > Il Ven 22 Mag 2020, 23:40 Jordan Zimmerman  > <mailto:jor...@jordanzimmerman.com>>
> > ha scritto:
> >
> >> Hi Everyone,
> >>
> >> I've coded a possible solution in the test project. See here:
> >>
> >> https://github.com/Randgalt/curator_5_0_test/blob/master/combo/pom.xml#L49 
> >> <https://github.com/Randgalt/curator_5_0_test/blob/master/combo/pom.xml#L49>
> >>
> >> It uses the Maven dependency plugin to create a small compatibility JAR
> >> that contains the Curator 4.3.0 versions of the classes that have changed
> >> in 5.0.0 (i.e. the ones that no longer return ListenerContainer). If this
> >> JAR is included in a CLASSPATH before Curator 5.0.0's JARs, these old
> >> classes will take precedence and thus old binaries will continue to work.
&g

Re: Curator 5.0 binary compatibility

2020-05-24 Thread Jordan Zimmerman
I agree with Cameron here. If you absolutely have to have binary compatibility 
you can use the hack I created at https://github.com/Randgalt/curator_5_0_test 
<https://github.com/Randgalt/curator_5_0_test> - if need be, we can distributed 
the hack with 5.0. 

CURATOR COMMUNITY - PLEASE CHIME IN HERE - The decision we make here will 
affect Curator for a long time. Now's your chance to have input on that 
direction.

-Jordan

> On May 24, 2020, at 3:48 PM, Cameron McKenzie  wrote:
> 
> Enrico,
> Can you explain your environment that exposes these backwards
> compatibility issues? I am probably coming from a place of ignorance, but I
> haven't seen new versions of a third party binary being dropped into an
> existing environment without recompiling the application, so I have never
> encountered these binary compatibility issues before. My expectation with
> this release was that if you wanted to pickup the changes in Curator 5.0
> that you would rebuild your application against the new binaries and then
> redeploy the application. Obviously this compilation will break if you are
> using any of the changed APIs, but they are pretty trivial change to fix.
> We could potentially deprecate the existing APIs and add the new ones, but
> this will produce more tech debt to clean up later.
> cheers
> 
> On Sat, May 23, 2020 at 7:40 PM Enrico Olivelli  wrote:
> 
>> I will check you trick ad soon as possible. I am sorry, this is a very
>> busy week for me and do not have enough cycles. But I think that we should
>> address this problem in order to ease the adoption of the new code and APIs.
>> 
>> Did you evaluate to eventually rollback the breaking changes?
>> 
>> Another alternative, if we want to let users use both the old and the new
>> APIs is to simply rename all of the packages and start a brand new system.
>> This approach was done in Apache Commons and IIRC it will be done with
>> Netty5. We also did it with the new Apache Bookkeeper API.
>> 
>> Pros:
>> No need to preserve compatibility, we are free to clean up all of the tech
>> debt.
>> The switch to Curator 5 will be explicit opted in
>> 
>> Cons:
>> Cherry picks won't be straightforward.
>> 
>> Enrico
>> 
>> Il Ven 22 Mag 2020, 23:40 Jordan Zimmerman 
>> ha scritto:
>> 
>>> Hi Everyone,
>>> 
>>> I've coded a possible solution in the test project. See here:
>>> 
>>> https://github.com/Randgalt/curator_5_0_test/blob/master/combo/pom.xml#L49
>>> 
>>> It uses the Maven dependency plugin to create a small compatibility JAR
>>> that contains the Curator 4.3.0 versions of the classes that have changed
>>> in 5.0.0 (i.e. the ones that no longer return ListenerContainer). If this
>>> JAR is included in a CLASSPATH before Curator 5.0.0's JARs, these old
>>> classes will take precedence and thus old binaries will continue to work.
>>> The curator_5_0_test shows this. run.sh is the previous way with the
>>> error. run-compatibility.sh is with the compatibility JAR.
>>> 
>>> Thoughts? Notable, this doesn't change the master code of Curator at all.
>>> We could add it to the 5.0 release. I don't think there's an issue with
>>> this "hack". Can anyone think of one? I'd really appreciate people testing
>>> with it. Try a build with just Curator 5.0 and then install and include
>>> this curator-5_0-test:combo:1.0-SNAPSHOT early in the CLASSPATH - it should
>>> work.
>>> 
>>> -Jordan
>>> 
>>> On May 21, 2020, at 10:43 AM, Jordan Zimmerman <
>>> jor...@jordanzimmerman.com> wrote:
>>> 
>>> Hello All,
>>> 
>>> Sorry for the cross-posting but this is important enough to justify it.
>>> 
>>> Apache Curator is in the process of releasing version 5.0. We've taken
>>> the opportunity to address some long standing tech debt but this causes
>>> breaking changes. We've detailed the breaks here:
>>> http://curator.apache.org/staging/breaking-changes.html. The Clirr
>>> report shows the exact API changes:
>>> http://curator.apache.org/staging/curator-recipes/clirr-report.html. The
>>> first two of these are the most worrisome. NodeCache's and
>>> PathChildrenCache's getListenable() methods now have a different return
>>> type. This has far reaching implications. If a Curator user were to drop in
>>> Curator 5.0 without any code changes they will get runtime exceptions when
>>> these methods are called.
>>> 
>>> I've written a test that shows the problem:
>>> 
>&g

Re: Curator 5.0 binary compatibility

2020-05-22 Thread Jordan Zimmerman
Hi Everyone,

I've coded a possible solution in the test project. See here:


https://github.com/Randgalt/curator_5_0_test/blob/master/combo/pom.xml#L49 
<https://github.com/Randgalt/curator_5_0_test/blob/master/combo/pom.xml#L49>

It uses the Maven dependency plugin to create a small compatibility JAR that 
contains the Curator 4.3.0 versions of the classes that have changed in 5.0.0 
(i.e. the ones that no longer return ListenerContainer). If this JAR is 
included in a CLASSPATH before Curator 5.0.0's JARs, these old classes will 
take precedence and thus old binaries will continue to work. The 
curator_5_0_test shows this. run.sh is the previous way with the error. 
run-compatibility.sh is with the compatibility JAR.

Thoughts? Notable, this doesn't change the master code of Curator at all. We 
could add it to the 5.0 release. I don't think there's an issue with this 
"hack". Can anyone think of one? I'd really appreciate people testing with it. 
Try a build with just Curator 5.0 and then install and include this 
curator-5_0-test:combo:1.0-SNAPSHOT early in the CLASSPATH - it should work.

-Jordan

> On May 21, 2020, at 10:43 AM, Jordan Zimmerman  
> wrote:
> 
> Hello All,
> 
> Sorry for the cross-posting but this is important enough to justify it.
> 
> Apache Curator is in the process of releasing version 5.0. We've taken the 
> opportunity to address some long standing tech debt but this causes breaking 
> changes. We've detailed the breaks here: 
> http://curator.apache.org/staging/breaking-changes.html 
> <http://curator.apache.org/staging/breaking-changes.html>. The Clirr report 
> shows the exact API changes: 
> http://curator.apache.org/staging/curator-recipes/clirr-report.html 
> <http://curator.apache.org/staging/curator-recipes/clirr-report.html>. The 
> first two of these are the most worrisome. NodeCache's and 
> PathChildrenCache's getListenable() methods now have a different return type. 
> This has far reaching implications. If a Curator user were to drop in Curator 
> 5.0 without any code changes they will get runtime exceptions when these 
> methods are called. 
> 
> I've written a test that shows the problem:
> 
> git clone https://github.com/Randgalt/curator_5_0_test.git 
> <https://github.com/Randgalt/curator_5_0_test.git>
> cd curator_5_0_test
> ./run.sh
> 
> You will see:
> 
> java.lang.NoSuchMethodError: 
> org.apache.curator.framework.recipes.cache.PathChildrenCache.getListenable()Lorg/apache/curator/framework/listen/ListenerContainer;
>   at binary.Curator50Test.run(Curator50Test.java:26)
>   at test.Test.main(Test.java:9)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
>   at java.lang.Thread.run(Thread.java:748)
> 
> Enrico Olivelli brought this to our attention. Curator 5.0 is a major version 
> bump so breaking changes are implied. But, maybe this is blocker? What do 
> people think? If this is a serious enough concern we can come up with a 
> workaround. 
> 
> Please discuss and let's hold off completing the current release until this 
> has been fully discussed.
> 
> -Jordan



Curator 5.0 binary compatibility

2020-05-21 Thread Jordan Zimmerman
Hello All,

Sorry for the cross-posting but this is important enough to justify it.

Apache Curator is in the process of releasing version 5.0. We've taken the 
opportunity to address some long standing tech debt but this causes breaking 
changes. We've detailed the breaks here: 
http://curator.apache.org/staging/breaking-changes.html 
. The Clirr report 
shows the exact API changes: 
http://curator.apache.org/staging/curator-recipes/clirr-report.html 
. The 
first two of these are the most worrisome. NodeCache's and PathChildrenCache's 
getListenable() methods now have a different return type. This has far reaching 
implications. If a Curator user were to drop in Curator 5.0 without any code 
changes they will get runtime exceptions when these methods are called. 

I've written a test that shows the problem:

git clone https://github.com/Randgalt/curator_5_0_test.git
cd curator_5_0_test
./run.sh

You will see:

java.lang.NoSuchMethodError: 
org.apache.curator.framework.recipes.cache.PathChildrenCache.getListenable()Lorg/apache/curator/framework/listen/ListenerContainer;
at binary.Curator50Test.run(Curator50Test.java:26)
at test.Test.main(Test.java:9)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:748)

Enrico Olivelli brought this to our attention. Curator 5.0 is a major version 
bump so breaking changes are implied. But, maybe this is blocker? What do 
people think? If this is a serious enough concern we can come up with a 
workaround. 

Please discuss and let's hold off completing the current release until this has 
been fully discussed.

-Jordan

Re: UTILITY TO HANDLE ZNODE PATHS WITH PROTECTED MODE

2020-05-09 Thread Jordan Zimmerman
It's OK with me. Open an issue and create a PR.

-JZ

> On May 9, 2020, at 2:43 AM, evaristo.camar...@yahoo.es wrote:
> 
> Hi there,
> We are using Curator + Zookeeper in a manner that our ZNode paths contain 
> meaningful information.
> 
> The model is like:
> 
> /root/report/
> /root/report/R1
> /root/report/R2
> /root/report/R3
>  
> In several uses cases we are using PersistentNodes recipe (Ephemeral mode) to 
> produce R1, R2 ... reports. Another process is watching all reports (via 
> PathChildrenCache), and we are parsing the ZNode paths to determine the 
> report numbering...
> We are evaluating using protectedMode in the reports; protected mode is 
> prefixing the ZNode names ("_c_"). Following previous example:
>  
> /root/report/
> /root/report/_c_012-3238-323232-434354-R1
> /root/report/_c_021-4239-523233-634355-R2
> /root/report/_c_031-5230-623234-834356-R3
>  
> It is not a big deal to remove the prefix to parse the meaningful information 
> (R1, R2...), but it would be nice to be able to do that in a backwards 
> compatible way and resilient to Curator updates.
> An easy and generic way to achieve this goal, is adding a public API that 
> allows to determine if a ZNode name is using Curator's protected mode, and 
> another method able to remove protected mode prefix from a given ZNode name. 
> In my view CreateBuilderMain interface could be extended with default methods 
> able to do that. Notice that currently CreateBuilderImpl is the one having 
> the logic to generate the protectedMode path prefix, and probably that logic 
> should be also implemented (or at least specified) in interface in order to 
> allow a consistent behavior for different implementations.
> What do you think? Do you think is worthy to specify protectedModel prefixes 
> format be and provide methods to handle with that?
> Thanks in advance,
> /Evaristo



Re: CURATOR WITH MULTIPLE ZK SERVERS

2020-05-03 Thread Jordan Zimmerman
We don't need a PR for a custom ensemble provider. I think the change in the 
builder is still needed. If you're going to just ignore the EnsembleTracker 
it's wasteful to run it. It should be disabled.

-JZ

> On May 3, 2020, at 1:44 PM, evaristo.camar...@yahoo.es wrote:
> 
> 
> Works fine in my use case and connecString changes are ignored that is what I 
> am looking after.
>  
> I have not seen any problem in my app.
>  
> There are other CuratorFramework methods like config(), reconfig() and I am 
> not sure if this ImmutableEnsembleProvider could have any side effect there 
> and potentially affect any Curator recipe. I understand from your previous 
> answer that you do no think that is the case.
>  
> A possible solution for CURATOR-558 is to include an 
> ImmutableEnsembleProvider. That avoids adding extra methods to the 
> CuratorFrameworkFactory.Builder. What do you think is best? Do you still 
> believe that is worthy to extend the builder?
>  
> Thanks,
>  
> /Evaristo
> 
> 
> En domingo, 3 de mayo de 2020 20:09:04 CEST, Jordan Zimmerman 
>  escribió:
> 
> 
> Try it and see. Why would there be side effects?
> 
> -Jordan
> 
>> On May 3, 2020, at 12:42 PM, evaristo.camar...@yahoo.es 
>> <mailto:evaristo.camar...@yahoo.es> wrote:
>> 
>> Hi again,
>> I am looking for a possible workaround in my application until there is new 
>> Curator release with Curator 558
>> I am just curious about the side effects of using a custom EnsembleProvider 
>> that basically ignores new connectStrings passed via setConnectString... I 
>> guess connecString will not be changed, but I am not sure of possible side 
>> effects on Curator.
>>  
>> public class CustomEnsembleProvider implements EnsembleProvider
>> {
>> private final String connectionString;
>> /**
>>  * The connection string to use
>>  *
>>  * @param connectionString connection string
>>  */
>> public CustomEnsembleProvider (String connectionString)
>> {
>> this.connecitonString = Objects.requiernonNull(connectionString);
>> }
>>  
>> @Override
>> public void start() throws Exception
>> {
>> // DO nothing
>> }
>>  
>> @Override
>> public void close() throws IOException
>> {
>> // DO nothing 
>> }
>>  
>> @Override
>> public void setConnectionString(String connectionString)
>> {
>> // DO nothing
>> }
>>  
>> @Override
>> public String getConnectionString()
>> {
>> return connectionString;
>> }
>>  
>> @Override
>> public boolean updateServerListEnabled()
>> {
>> return false;
>> }
>> }
>> Thanks in advance,
>>  /Evaristo
>>  
>> 
>> En viernes, 1 de mayo de 2020 19:33:15 CEST, evaristo.camar...@yahoo.es 
>> <mailto:evaristo.camar...@yahoo.es> > <mailto:evaristo.camar...@yahoo.es>> escribió:
>> 
>> 
>> Thx Jordan. I will a open JIRA ticket with a PR as discussed
>> En viernes, 1 de mayo de 2020 19:18:25 CEST, Jordan Zimmerman 
>> mailto:jor...@jordanzimmerman.com>> escribió:
>> 
>> 
>> For backward compatibility I think Option 1 is the only option. We can't 
>> know who's reliant on the current behavior.
>> 
>> -JZ
>> 
>>> On May 1, 2020, at 12:13 PM, evaristo.camar...@yahoo.es 
>>> <mailto:evaristo.camar...@yahoo.es> wrote:
>>> 
>>> Thx Jordan for the fast answer.
>>> 
>>> I see 2 options here:
>>> 
>>> 1.- Add a new option in CuratorFrameworkFactory.Builder -> for instance 
>>> .skipEnsembleTracking. If the option is present make ensembleTracker null 
>>> in CuratorFrameworkImpl
>>> 2.- Other option is to modify EnsembleTracker, and if the EnsembleProvider 
>>> is a FixedEnsembleProvider with updatSErverList set to false, ignore 
>>> configuration events.
>>> 
>>> What do you think is better?
>>> - Personally option 1 looks to me a bit strange: If Builder allows to use a 
>>> FixedEnsebleProvider with updatServers set to false, but servers list will 
>>> be really updated unless you are using the new option that is a bit 
>>> misleading.
>>> - Option 2 has a more clear API in my view, if FixedEnsembleProvider is 
>>> used with updateServerList set to false, then EnsembleTracker can ignore 
>>> updating the connecstring.
>>> 
>>>

Re: CURATOR WITH MULTIPLE ZK SERVERS

2020-05-03 Thread Jordan Zimmerman
Try it and see. Why would there be side effects?

-Jordan

> On May 3, 2020, at 12:42 PM, evaristo.camar...@yahoo.es wrote:
> 
> Hi again,
> I am looking for a possible workaround in my application until there is new 
> Curator release with Curator 558
> I am just curious about the side effects of using a custom EnsembleProvider 
> that basically ignores new connectStrings passed via setConnectString... I 
> guess connecString will not be changed, but I am not sure of possible side 
> effects on Curator.
>  
> public class CustomEnsembleProvider implements EnsembleProvider
> {
> private final String connectionString;
> /**
>  * The connection string to use
>  *
>  * @param connectionString connection string
>  */
> public CustomEnsembleProvider (String connectionString)
> {
> this.connecitonString = Objects.requiernonNull(connectionString);
> }
>  
> @Override
> public void start() throws Exception
> {
> // DO nothing
> }
>  
> @Override
> public void close() throws IOException
> {
> // DO nothing 
> }
>  
> @Override
> public void setConnectionString(String connectionString)
> {
> // DO nothing
> }
>  
> @Override
> public String getConnectionString()
> {
> return connectionString;
> }
>  
> @Override
> public boolean updateServerListEnabled()
> {
> return false;
> }
> }
> Thanks in advance,
>  /Evaristo
>  
> 
> En viernes, 1 de mayo de 2020 19:33:15 CEST, evaristo.camar...@yahoo.es 
>  escribió:
> 
> 
> Thx Jordan. I will a open JIRA ticket with a PR as discussed
> En viernes, 1 de mayo de 2020 19:18:25 CEST, Jordan Zimmerman 
>  escribió:
> 
> 
> For backward compatibility I think Option 1 is the only option. We can't know 
> who's reliant on the current behavior.
> 
> -JZ
> 
>> On May 1, 2020, at 12:13 PM, evaristo.camar...@yahoo.es 
>> <mailto:evaristo.camar...@yahoo.es> wrote:
>> 
>> Thx Jordan for the fast answer.
>> 
>> I see 2 options here:
>> 
>> 1.- Add a new option in CuratorFrameworkFactory.Builder -> for instance 
>> .skipEnsembleTracking. If the option is present make ensembleTracker null in 
>> CuratorFrameworkImpl
>> 2.- Other option is to modify EnsembleTracker, and if the EnsembleProvider 
>> is a FixedEnsembleProvider with updatSErverList set to false, ignore 
>> configuration events.
>> 
>> What do you think is better?
>> - Personally option 1 looks to me a bit strange: If Builder allows to use a 
>> FixedEnsebleProvider with updatServers set to false, but servers list will 
>> be really updated unless you are using the new option that is a bit 
>> misleading.
>> - Option 2 has a more clear API in my view, if FixedEnsembleProvider is used 
>> with updateServerList set to false, then EnsembleTracker can ignore updating 
>> the connecstring.
>> 
>> Regards,
>> 
>> /Evaristo
>>  
>> 
>> En viernes, 1 de mayo de 2020 18:17:17 CEST, Jordan Zimmerman 
>> mailto:jor...@jordanzimmerman.com>> escribió:
>> 
>> 
>> I think this might be a bug or maybe an oversight. It looks like the 
>> EnsembleTracker gets enabled regardless of the mode that Curator is in: see 
>> CuratorFrameworkImpl's constructor where it allocates an EnsembleTracker. 
>> You could open a PR that makes this optional.
>> 
>> -Jordan
>> 
>>> On Apr 30, 2020, at 10:00 AM, evaristo.camar...@yahoo.es 
>>> <mailto:evaristo.camar...@yahoo.es> wrote:
>>> 
>>> Hi there,
>>>  
>>> We have an app that is connected to 2 different ZK clusters (one ZK in the 
>>> same data center, a remote ZK cluster in a remote data center). In the same 
>>> JVM, we instantiate 2 CuratorFramework instances (once per ZK cluster)
>>>  
>>> We were using ZK clusters with zk 3.4.10
>>> And Curator clients with Curator 4.3.0 + Zk 3.4.10
>>> Everything working fine.
>>>  
>>> Now we are  upgrading zk servers to 3.5.6 and making tests with this 
>>> combination:
>>>  
>>> ZK clusters are ZK 3.5.6
>>> And our Curator Clients are Curator 4.3.0 + ZK 3.4.10 (working in 
>>> compatibility mode)
>>>  
>>> There is a use case that is failing and probably is related with ZK dynamic 
>>> reconfiguration capabilities supported in ZK 3.5 and would like to 
>>> understand better how Curator handles this.
>>>  
>>> The issue a

Re: Dealing with eventual consistency

2020-05-02 Thread Jordan Zimmerman
That part is true I'm not sure how much use it is though. If there are multiple 
writers you can't know what the latest version is - there may be other servers 
writing that have been seen yet. But, then, I probably don't completely 
understand the use case.

-JZ

> On May 2, 2020, at 9:35 AM, Scott Blum  wrote:
> 
> I don't follow... I'm saying that if server A wants to be absolutely sure a 
> write is visible to server B, it can grab the mxid of the write, send it to 
> server B, and server B can ensure that its LastZxid >= that value before 
> doing the read.



Re: Dealing with eventual consistency

2020-05-01 Thread Jordan Zimmerman
You can still miss a pending write on the client. Just because you read N from 
the ZNode doesn't mean that that's it's real value on the leader. The only way 
in ZooKeeper to be certain is to do a write as those always take place on the 
leader.

-Jordan

> On May 1, 2020, at 5:51 PM, Scott Blum  wrote:
> 
> On Fri, May 1, 2020 at 6:00 PM David Smiley  > wrote:
> I don't think I'll use the "LastZxid" trick because we update parts of the ZK 
> tree with high frequency but not this one yet the Zxid would still soar 
> upwards.
> 
> I don't follow.. if server A writes a node, recording the mzxid associated 
> with that write, and passes it along to server B, then server B just needs to 
> be sure its LastZxid is >= the one that server A wrote.  Doesn't matter if 
> server B's LastZxid is the same, 10 ahead, or 1000 ahead.



Re: Dealing with eventual consistency

2020-05-01 Thread Jordan Zimmerman
You can also know whether or not you have a recent version of a ZNode by using 
the version. Read the node and its Stat, save that (Curator's Cache has this) 
and then update the node using the version. All updates occur on the current 
ZooKeeper leader so you get an exception if the version doesn't match.

Other than that, ZK is an eventually consistent system as you know. 

-JZ

> On May 1, 2020, at 5:00 PM, David Smiley  wrote:
> 
> Thanks for pointing out that conversation RE sync() "Consistency Guarantees". 
>  It's a shame sync() has that deficiency of not actually getting a quorum, 
> thus there's still an edge case.
> 
> I don't think I'll use the "LastZxid" trick because we update parts of the ZK 
> tree with high frequency but not this one yet the Zxid would still soar 
> upwards.
> 
> The Cache stuff is really nifty but still leaves an eventual consistency 
> issue.  Machine A need to get machine B to do something predicated on the 
> cached thing having a version >= some value that Machine A knows.
> 
> I think I must just deal with the fact of passing on a version / xid even 
> though I said for the system in question it's ugly.  I need to make it look 
> pretty :-)
> 
> ~ David
> 
> 
> On Fri, May 1, 2020 at 11:46 AM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
>> I'm aware a znode or perhaps pzxid (for a path of data) could be passed to 
>> the second machine.
> 
> It's available to the client. See: 
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L810
>  
> <https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L810>
>  
> 
> You'd create your own ZooKeeper subclass - "cnxn" is a protected field. So, 
> something like:
> 
> public class MyZooKeeper extends ZooKeeper {
>   ...
> 
>   public long getLastZxid() {
>   return cnxn.getLastZxid();
>   }
> }
> 
>> But I'm worried about over-using sync() because I imagine it's not free.
> 
> TBH - I've never understood the sync() method. I always thought it was 
> useless but Alex Shraer wrote some ways that it can be useful. See: 
> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201908.mbox/thread?2 
> <https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201908.mbox/thread?2>
>  (search for "Consistency Guarantees").
> 
>> Imagine this is some immutable configuration data that is set once then used 
>> a lot by many ZK clients thereafter many thousands of times for days on end.
> 
> Note: this is how Facebook uses ZooKeeper. It's the backbone of their config 
> system. They have 10s of thousands (something like that) of read-only 
> observers that sit in front of their main ZK ensemble.
> 
>>  Does Curator facilitate this in any way?
> 
> This is what the various Cache recipes are for (Scott's TreeCache or the 
> upcoming CuratorCache). These recipes take care of pulling down the latest 
> versions of ZNodes for you.
> 
> -Jordan
> 
>> On May 1, 2020, at 12:43 AM, David Smiley > <mailto:dsmi...@apache.org>> wrote:
>> 
>> Hello,
>> 
>> I'm trying to come to grips with the ramifications of ZooKeeper's eventual 
>> consistency model and what mechanisms exists to help.  
>> 
>> Imagine two machines that use ZK, one of which stores data in it then tells 
>> the other machine to do something that will require it to read what the 
>> first machine wrote.
>> 
>> ZK's docs warn about this:
>> https://github.com/apache/zookeeper/blob/master/zookeeper-docs/src/main/resources/markdown/zookeeperProgrammers.md#ch_zkGuarantees
>>  
>> <https://github.com/apache/zookeeper/blob/master/zookeeper-docs/src/main/resources/markdown/zookeeperProgrammers.md#ch_zkGuarantees>
>>  .. and refer to a sync() method to help:
>> https://github.com/apache/zookeeper/blob/2e14a29cc6e58d9561e80b737a3168fbb1f752b4/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeper.java#L3057
>>  
>> <https://github.com/apache/zookeeper/blob/2e14a29cc6e58d9561e80b737a3168fbb1f752b4/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeper.java#L3057>
>> 
>> But I'm worried about over-using sync() because I imagine it's not free.  
>> For the scenario I have in mind, the vast majority of the time, the second 
>> machine will see the latest state because lots of time passes between the 
>> write and the read.  Imagine this is some immutable configuration data that 
>> is set once then used a lot by many ZK clients thereafter many thousands of 
>> times for 

Re: CURATOR WITH MULTIPLE ZK SERVERS

2020-05-01 Thread Jordan Zimmerman
For backward compatibility I think Option 1 is the only option. We can't know 
who's reliant on the current behavior.

-JZ

> On May 1, 2020, at 12:13 PM, evaristo.camar...@yahoo.es wrote:
> 
> Thx Jordan for the fast answer.
> 
> I see 2 options here:
> 
> 1.- Add a new option in CuratorFrameworkFactory.Builder -> for instance 
> .skipEnsembleTracking. If the option is present make ensembleTracker null in 
> CuratorFrameworkImpl
> 2.- Other option is to modify EnsembleTracker, and if the EnsembleProvider is 
> a FixedEnsembleProvider with updatSErverList set to false, ignore 
> configuration events.
> 
> What do you think is better?
> - Personally option 1 looks to me a bit strange: If Builder allows to use a 
> FixedEnsebleProvider with updatServers set to false, but servers list will be 
> really updated unless you are using the new option that is a bit misleading.
> - Option 2 has a more clear API in my view, if FixedEnsembleProvider is used 
> with updateServerList set to false, then EnsembleTracker can ignore updating 
> the connecstring.
> 
> Regards,
> 
> /Evaristo
>  
> 
> En viernes, 1 de mayo de 2020 18:17:17 CEST, Jordan Zimmerman 
>  escribió:
> 
> 
> I think this might be a bug or maybe an oversight. It looks like the 
> EnsembleTracker gets enabled regardless of the mode that Curator is in: see 
> CuratorFrameworkImpl's constructor where it allocates an EnsembleTracker. You 
> could open a PR that makes this optional.
> 
> -Jordan
> 
>> On Apr 30, 2020, at 10:00 AM, evaristo.camar...@yahoo.es 
>> <mailto:evaristo.camar...@yahoo.es> wrote:
>> 
>> Hi there,
>>  
>> We have an app that is connected to 2 different ZK clusters (one ZK in the 
>> same data center, a remote ZK cluster in a remote data center). In the same 
>> JVM, we instantiate 2 CuratorFramework instances (once per ZK cluster)
>>  
>> We were using ZK clusters with zk 3.4.10
>> And Curator clients with Curator 4.3.0 + Zk 3.4.10
>> Everything working fine.
>>  
>> Now we are  upgrading zk servers to 3.5.6 and making tests with this 
>> combination:
>>  
>> ZK clusters are ZK 3.5.6
>> And our Curator Clients are Curator 4.3.0 + ZK 3.4.10 (working in 
>> compatibility mode)
>>  
>> There is a use case that is failing and probably is related with ZK dynamic 
>> reconfiguration capabilities supported in ZK 3.5 and would like to 
>> understand better how Curator handles this.
>>  
>> The issue appears when our application can connect to the local cluster, but 
>> connectivity with remote cluster is broken.
>> At this point session the CuratorFramework session with remote cluster is 
>> expired (SUSPENDED, LOST...), and we can observe that curator calculates 
>> again the connectString
>> : "Connection string changed to: "
>> , ignoring the initial connect string provided when CuratorFramework 
>> instance was created. This new connectString is not valid for the remote 
>> cluster; the reason is that our routing is a bit complex, and we connect to 
>> remote cluster via a Virtual IP (and this what we use when connecting to 
>> remote zk cluster cluster) provided by a Load balancer (In this case is a 
>> Kubernetes ingress gateway)
>>  
>> The original connect string was: 10.10.10.10: 2181 (Actually is Virtual IP 
>> to connect with remote cluster), but the new negotiated connect string is 
>> different because reflects the remote cluster topology internals with 
>> multiple servers
>>  
>> I saw that CuratorFrameworkFactory builder allows to set the 
>> EnsembleProvider when creating CF instances, and I am not 100% sure if using 
>> a FixedEnsembleProvider with updateServerListEnabled to false could be 
>> enough to fix the issue. So, I would appreciate any information about how 
>> CuratorFramework 4.3 handles connectString changes, and a possible way 
>> forward to limit that connectString is changed dynamically.
>>  
>> Thanks in advance,
>>  
>> /Evaristo
>> 
> 



Re: CURATOR WITH MULTIPLE ZK SERVERS

2020-05-01 Thread Jordan Zimmerman
I think this might be a bug or maybe an oversight. It looks like the 
EnsembleTracker gets enabled regardless of the mode that Curator is in: see 
CuratorFrameworkImpl's constructor where it allocates an EnsembleTracker. You 
could open a PR that makes this optional.

-Jordan

> On Apr 30, 2020, at 10:00 AM, evaristo.camar...@yahoo.es wrote:
> 
> Hi there,
>  
> We have an app that is connected to 2 different ZK clusters (one ZK in the 
> same data center, a remote ZK cluster in a remote data center). In the same 
> JVM, we instantiate 2 CuratorFramework instances (once per ZK cluster)
>  
> We were using ZK clusters with zk 3.4.10
> And Curator clients with Curator 4.3.0 + Zk 3.4.10
> Everything working fine.
>  
> Now we are  upgrading zk servers to 3.5.6 and making tests with this 
> combination:
>  
> ZK clusters are ZK 3.5.6
> And our Curator Clients are Curator 4.3.0 + ZK 3.4.10 (working in 
> compatibility mode)
>  
> There is a use case that is failing and probably is related with ZK dynamic 
> reconfiguration capabilities supported in ZK 3.5 and would like to understand 
> better how Curator handles this.
>  
> The issue appears when our application can connect to the local cluster, but 
> connectivity with remote cluster is broken.
> At this point session the CuratorFramework session with remote cluster is 
> expired (SUSPENDED, LOST...), and we can observe that curator calculates 
> again the connectString
> : "Connection string changed to: "
> , ignoring the initial connect string provided when CuratorFramework instance 
> was created. This new connectString is not valid for the remote cluster; the 
> reason is that our routing is a bit complex, and we connect to remote cluster 
> via a Virtual IP (and this what we use when connecting to remote zk cluster 
> cluster) provided by a Load balancer (In this case is a Kubernetes ingress 
> gateway)
>  
> The original connect string was: 10.10.10.10: 2181 (Actually is Virtual IP to 
> connect with remote cluster), but the new negotiated connect string is 
> different because reflects the remote cluster topology internals with 
> multiple servers
>  
> I saw that CuratorFrameworkFactory builder allows to set the EnsembleProvider 
> when creating CF instances, and I am not 100% sure if using a 
> FixedEnsembleProvider with updateServerListEnabled to false could be enough 
> to fix the issue. So, I would appreciate any information about how 
> CuratorFramework 4.3 handles connectString changes, and a possible way 
> forward to limit that connectString is changed dynamically.
>  
> Thanks in advance,
>  
> /Evaristo
> 



Re: Dealing with eventual consistency

2020-05-01 Thread Jordan Zimmerman
> I'm aware a znode or perhaps pzxid (for a path of data) could be passed to 
> the second machine.

It's available to the client. See: 
https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L810
 

You'd create your own ZooKeeper subclass - "cnxn" is a protected field. So, 
something like:

public class MyZooKeeper extends ZooKeeper {
...

public long getLastZxid() {
return cnxn.getLastZxid();
}
}

> But I'm worried about over-using sync() because I imagine it's not free.

TBH - I've never understood the sync() method. I always thought it was useless 
but Alex Shraer wrote some ways that it can be useful. See: 
https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201908.mbox/thread?2 
(search for "Consistency Guarantees").

> Imagine this is some immutable configuration data that is set once then used 
> a lot by many ZK clients thereafter many thousands of times for days on end.

Note: this is how Facebook uses ZooKeeper. It's the backbone of their config 
system. They have 10s of thousands (something like that) of read-only observers 
that sit in front of their main ZK ensemble.

>  Does Curator facilitate this in any way?

This is what the various Cache recipes are for (Scott's TreeCache or the 
upcoming CuratorCache). These recipes take care of pulling down the latest 
versions of ZNodes for you.

-Jordan

> On May 1, 2020, at 12:43 AM, David Smiley  wrote:
> 
> Hello,
> 
> I'm trying to come to grips with the ramifications of ZooKeeper's eventual 
> consistency model and what mechanisms exists to help.  
> 
> Imagine two machines that use ZK, one of which stores data in it then tells 
> the other machine to do something that will require it to read what the first 
> machine wrote.
> 
> ZK's docs warn about this:
> https://github.com/apache/zookeeper/blob/master/zookeeper-docs/src/main/resources/markdown/zookeeperProgrammers.md#ch_zkGuarantees
>  
> 
>  .. and refer to a sync() method to help:
> https://github.com/apache/zookeeper/blob/2e14a29cc6e58d9561e80b737a3168fbb1f752b4/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeper.java#L3057
>  
> 
> 
> But I'm worried about over-using sync() because I imagine it's not free.  For 
> the scenario I have in mind, the vast majority of the time, the second 
> machine will see the latest state because lots of time passes between the 
> write and the read.  Imagine this is some immutable configuration data that 
> is set once then used a lot by many ZK clients thereafter many thousands of 
> times for days on end.
> 
> I'm aware a znode or perhaps pzxid (for a path of data) could be passed to 
> the second machine... but for the system in question, this would be really 
> ugly and I want to consider alternatives.  Besides, the data is organized 
> into a tree that could have arbitrary nesting, so it's not clear to me that 
> there's a single version for this any way.
> 
> Scott Blum told me about how there's an increasing "zxid" for all state 
> change in ZK.  I can see this on ZK's ClientCnxn.getLastZxid().  If I were to 
> pass that zxid to the additional machines (ZK clients) from the first for 
> basically all interactions (not too ugly for the system), how would the 
> receiving machine use this to get in sync?  I'm guessing it could read its 
> own connection zxid and if it's out of date than call sync()?  Does that make 
> sense?  Is there another strategy to be recommended?  Does Curator facilitate 
> this in any way?
> 
> Thanks in advance!  I already searched this list for answers.
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> 


Re: Handle retriable exception?

2020-04-24 Thread Jordan Zimmerman
Yeah - I think that part of the code (and several others) need a re-think. We 
have the opportunity right now as we're going to do 5.0. I'm amenable to 
holding off and changing whatever we think can be improved.

-Jordan

> On Apr 24, 2020, at 10:12 PM, tison  wrote:
> 
> Hi curators,
> 
> It seems in some scenario I'd like to set ExponentialBackoffRetry as retry 
> policy so that Curator
> automatically retry actions for me, while I can also handle retriable 
> exception on demand. For
> example, trigger some cleanup/fencing actions on session expiration.
> 
> One possible way in my mind is that I have different client to describe 
> different logic. But given
> that the interface of RetryPolicy is
> 
> RetryPolicy#allowRetry(int retryCount, long elapsedTimeMs, RetrySleeper 
> sleeper);
> 
> We don't have an opportunity to handle different exception type.
> 
> An alternative is we always also pass retriable exception to background 
> callback while the default
> action is omit it but enable user customizes it.
> 
> Any thoughts?
> 
> Best,
> tison.



Re: Release of 5.0?

2020-04-23 Thread Jordan Zimmerman
OK - Shay or Cameron please decide between the two of you. Whoever does it can 
wait for 3.6.1 too if they want.

> On Apr 23, 2020, at 10:26 AM, shay shimony  wrote:
> 
> Hi, I have time in the coming week, so if you want to skip this one - I can 
> take it.
> 
> On Thu, Apr 23, 2020 at 7:45 AM Cameron McKenzie  <mailto:mckenzie@gmail.com>> wrote:
> I'm happy for a release. I can do it the build if no one else has time.
> cheers
> 
> On Thu, Apr 23, 2020 at 7:25 AM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> Hi Folks,
> 
> Are we ready for a 5.0 release? I think we have a good working set for it. 
> Speak now if you want anything else in this release. If no one speaks up, 
> I'll start on it soon (unless one of the other committers wants to run it - 
> let me know).
> 
> -Jordan



Release of 5.0?

2020-04-22 Thread Jordan Zimmerman
Hi Folks,

Are we ready for a 5.0 release? I think we have a good working set for it. 
Speak now if you want anything else in this release. If no one speaks up, I'll 
start on it soon (unless one of the other committers wants to run it - let me 
know).

-Jordan

Re: Help with apache curator 4.3.0

2020-04-06 Thread Jordan Zimmerman
Apache projects communicate via email. So, if you have Curator questions, etc. 
use this channel. ZooKeeper also has an email list.

-Jordan

> On Apr 6, 2020, at 12:52 PM, Suhas S R  wrote:
> 
> Thanks,  Mr. Jordan. I will explore further on this. Is there a forum where I 
> can ask questions/clarify my doubts? 
> 
> -sr
> 
> On Sun, 5 Apr 2020 at 21:20, Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> Yes, dynamic reconfig is supported in Curator 4.x. There are no examples 
> because it's really a ZooKeeper feature, not a Curator feature. Curator's DSL 
> has a reconfig() method, etc. Curator also listens for dynamic ensemble 
> changes.
> 
> -Jordan
> 
> > On Apr 3, 2020, at 1:29 AM, Suhas S R  > <mailto:suha...@gmail.com>> wrote:
> > 
> > Hi Mr. Zimmerman, I am Suhas and I had a query on the apache-curator 
> > project.
> > 
> > We have a use case where we are planning to upgrade our existing system in 
> > zookeeper 3.4 to 3.6 so that we could use the dynamic-reconfig feature. We 
> > are currently using curator framework 2.12 and were evaluating options to 
> > upgrade to curator 4.3.
> > 
> > I wanted to know if dynamic reconfiguration functionality provided by 
> > zookeeper 3.5.x is supported through the curator 4.3.0? 
> > 
> > I could not find any examples/resources on the internet regarding this and 
> > going through the source code on GitHub there was no direct option to call 
> > the reconfig method.
> > 
> > Is it still work-in-progress or have I missed something?
> > 
> > Sorry to mail you here directly, if this is not the right forum can you 
> > please direct me to one.
> > 
> > Thanks,
> > Suhas SR
> 
> 
> 
> -- 
> Suhas SR



Re: Help with apache curator 4.3.0

2020-04-05 Thread Jordan Zimmerman
Yes, dynamic reconfig is supported in Curator 4.x. There are no examples 
because it's really a ZooKeeper feature, not a Curator feature. Curator's DSL 
has a reconfig() method, etc. Curator also listens for dynamic ensemble changes.

-Jordan

> On Apr 3, 2020, at 1:29 AM, Suhas S R  wrote:
> 
> Hi Mr. Zimmerman, I am Suhas and I had a query on the apache-curator project.
> 
> We have a use case where we are planning to upgrade our existing system in 
> zookeeper 3.4 to 3.6 so that we could use the dynamic-reconfig feature. We 
> are currently using curator framework 2.12 and were evaluating options to 
> upgrade to curator 4.3.
> 
> I wanted to know if dynamic reconfiguration functionality provided by 
> zookeeper 3.5.x is supported through the curator 4.3.0? 
> 
> I could not find any examples/resources on the internet regarding this and 
> going through the source code on GitHub there was no direct option to call 
> the reconfig method.
> 
> Is it still work-in-progress or have I missed something?
> 
> Sorry to mail you here directly, if this is not the right forum can you 
> please direct me to one.
> 
> Thanks,
> Suhas SR



Re: SSL Verifier issue with Curator 4.2 and ZK 3.5.6

2020-04-02 Thread Jordan Zimmerman
The reasoning is lost to history. I think Ioannis worked on that feature. In 
any event, a PR that makes configToConnectionString() non-static and 
overridable would be a nice addition I think. Please open an Jira Issue and a 
PR.

-Jordan

> On Apr 1, 2020, at 6:15 PM, Joe Ammann  wrote:
> 
> Hi
> 
> we are using Curator/ZK in a setup where all client-server traffic is using 
> SSL.
> 
> Now, we are trying to switch to DynamicConfiguration on the server side. We 
> are aware that the Config-Events for the EnsembleTracker currently only 
> contain the cleartext port, not the SSL port (see 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3166 
> ). With a custom 
> EnsembleProvider that overwrites the setConnectString() method and with a 
> convention (we always use 2181 for cleartext and 2281 for SSL), we have 
> worked around this limitation.
> 
> But we are hitting another problem now: The configToConnectionString method 
> in the EnsembleTracker basically takes the hostnames/aliases from the config 
> events, and generates a connectionString with IP addresses. Unfortunately, in 
> our dynamic network environment, we mainly use DNS aliases, and the IP 
> addresses don't necessarily resolve to the aliases. And our SSL certificates 
> only contain the DNS aliases, not the IP adresses or the physical hostnames.
> 
> This leads now to a situation where after a config event is received, Curator 
> creates new ZK instances with a connect string that contains IP addresses. 
> And then ZK refuses to connect because it can't verify the server certificate 
> hostnames.
> 
> io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: 
> General SSLEngine problem
> .
> Caused by: java.security.cert.CertificateException: Failed to verify both 
> host address and host name
> at 
> org.apache.zookeeper.common.ZKTrustManager.performHostVerification(ZKTrustManager.java:145)
> at 
> org.apache.zookeeper.common.ZKTrustManager.checkServerTrusted(ZKTrustManager.java:104)
> at 
> sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1496)
> ... 30 common frames omitted
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [DNSAlias]
> at 
> org.apache.zookeeper.common.ZKHostnameVerifier.matchDNSName(ZKHostnameVerifier.java:224)
> at 
> org.apache.zookeeper.common.ZKHostnameVerifier.verify(ZKHostnameVerifier.java:170)
> at 
> org.apache.zookeeper.common.ZKTrustManager.performHostVerification(ZKTrustManager.java:141)
> 
> Of course, we could turn off SSL hostname verification, but we'd rather not 
> do that.
> 
> ssl.hostnameVerification=false
> ssl.quorum.hostnameVerification=false
> 
> What is the reason the EnsembleTracker translates the hostnames/aliases from 
> the config event to IP adresses? I understand that this does not cause issues 
> with plaintext communication, but is easily breaks any dynamic SSL 
> environment.
> 
> Do you have any recommendation how I can fix this?
> 
> CU, Joe
> 



Re: [CURATOR-549] Recipes based on Persistent Recursive Watchers needs more reviews

2020-03-31 Thread Jordan Zimmerman
Hi Folks

Any objection to merging this tomorrow? Do you need more time Shay?

-Jordan

P.S. I hope everyone is staying well in this trying time

> On Mar 27, 2020, at 4:57 PM, Jordan Zimmerman  
> wrote:
> 
> Hi Folks
> 
> Another plea for more reviews on this 
> https://github.com/apache/curator/pull/335 
> <https://github.com/apache/curator/pull/335> - this will become the new 
> caching recipe for Curator folks. Here's your opportunity to have input on it.
> 
> -Jordan



[CURATOR-549] Recipes based on Persistent Recursive Watchers needs more reviews

2020-03-27 Thread Jordan Zimmerman
Hi Folks

Another plea for more reviews on this 
https://github.com/apache/curator/pull/335 
 - this will become the new caching 
recipe for Curator folks. Here's your opportunity to have input on it.

-Jordan

New Persistent/Recursive watcher recipes - needs reviews

2020-03-23 Thread Jordan Zimmerman
Hi Folks,

Here are the new recipes based on Persistent/Recursive Watchers. There's a new 
PersistentWatcher recipe and a brand new implementation of a cache, 
CuratorCache that replaces all the existing cache recipes. Reviews would be 
appreciated. 

https://github.com/apache/curator/pull/335 
 

-Jordan

Re: PROPOSAL: Curator 5.0

2020-03-15 Thread Jordan Zimmerman
Almost all Curator users should be unaffected. But, I should probably do some 
random checking on Github. The only ones affected are:

Those who want/need to stay on ZK 3.4 - they shouldn't upgrade to Curator 5.0
Clients using Curator's Reaper/ChildReaper classes. These clients should change 
to container nodes.
Any client code that uses Curator's ListenerContainer. This was an internal 
class but maybe some people have used it.
Any remaining Exhibitor users - Exhibitor has been dead for a long time. Those 
that still need support can stay on Curator 4.x.

-Jordan

> On Mar 15, 2020, at 1:59 AM, Enrico Olivelli  wrote:
> 
> In your vision, will Curator 5 be compatible with 'simple' applications 
> written for Curator 4?
> Curator is like Zookeeper, you can end up with having several libraries that 
> rely on it.
> If we don't keep compatibility in order to update to 5 you need every other 
> library to move to 5 and they won't move to 5 because they have many users on 
> 4.
> 
> We should deal carefully with this problem 
> 
> Btw obviously a great +2 to moving to zk 3.6.0
> 
> Enrico 
> 
> Il Sab 14 Mar 2020, 23:34 shay shimony  <mailto:shays...@gmail.com>> ha scritto:
> Sorry for the late reply.
> +1 from me too.
> 
> On Wed, Mar 4, 2020, 04:37 Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>>
> wrote:
> 
> > Hi Folks,
> >
> > I mentioned this a while back, but now the ZooKeeper 3.6.0 is out I'd like
> > to make it official. I propose we move to Curator 5.0 and apply a few
> > non-backward compatible changes - i.e. take the opportunity to fix some
> > tech debt. See CURATOR-558
> > <https://issues.apache.org/jira/browse/CURATOR-558 
> > <https://issues.apache.org/jira/browse/CURATOR-558>> for details. Any
> > objections?
> >
> > -Jordan
> >



PROPOSAL: Curator 5.0

2020-03-03 Thread Jordan Zimmerman
Hi Folks,

I mentioned this a while back, but now the ZooKeeper 3.6.0 is out I'd like to 
make it official. I propose we move to Curator 5.0 and apply a few non-backward 
compatible changes - i.e. take the opportunity to fix some tech debt. See 
CURATOR-558  for details. 
Any objections?

-Jordan

Re: Curator listener issue

2020-02-26 Thread Jordan Zimmerman
That sleep won't solve the problem as there is a single thread for Connection 
State Listeners. Instead, you can pass different Executors when you register 
your listener and your sleep would then have the desired effect.

-Jordan

> On Feb 26, 2020, at 10:37 AM, Arpit Jain  wrote:
> 
> Thats what exactly I also thought so I put a sleep with 10 seconds before 
> calling shutting down and closing the curator instance but it did not change 
> the behaviour.
> 
> On Wed, Feb 26, 2020 at 3:30 PM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> You are shutting down your server in the handleConnectionStateChange() 
> handler. So, maybe your application shutsdown/closes the Curator handle 
> before the other handler gets called. i.e. a kind of race.
> 
> -Jordan
> 
>> On Feb 26, 2020, at 10:15 AM, Arpit Jain > <mailto:jain.arp...@gmail.com>> wrote:
>> 
>> What is RetryLoopExecutor? Curator does all retries internally. You should 
>> not have your own retry mechanism.
>> Its just a retry loop for the submitted action. Will remove it if its 
>> already done by curator.
>> Calling blockUntilConnectedOrTimedOut() is unnecessary. Curator does this 
>> internally already.
>> Will remove it
>> What do you do with the "isConnected" value? That seems suspicious to me.
>> Its not used anywhere. Just for logging if we get connected or not
>> You do not get LOST until the session expires. How long is 
>> "coordinatorSessionTimeout"? You won't receive LOST until that has elapsed.
>>   "ConnectionTimeout": 1,
>>   "Hosts": "localhost:2181",
>>   "MaxRetries": 3,
>>   "RetryTimeout": 3000,
>>   "SessionTimeout": 9,
>> As I said earlier, I am receiving LOST on both application instances. Its 
>> only that specific listener is not getting called. Here are the logs 
>> 
>> [L: WARN] [O: c.t.s.c.ZookeeperHelper] [I: ] [U: ] [S: ] [P: platform2] [T: 
>> Curator-ConnectionStateManager-0] ZOOKEEPER STATE CHANGED TO : SUSPENDED
>> 2020-02-26 12:39:13.107+ [L: WARN] [O: o.a.c.f.s.ConnectionStateManager] 
>> [I: ] [U: ] [S: ] [P: platform2] [T: Curator-ConnectionStateManager-0] 
>> Session timeout has elapsed while SUSPENDED. Injecting a session expiration. 
>> Elapsed ms: 40003. Adjusted session timeout ms: 4
>> 2020-02-26 12:39:13.109+ [L: WARN] [O: o.a.c.f.s.ConnectionStateManager] 
>> [I: ] [U: ] [S: ] [P: platform2] [T: Curator-ConnectionStateManager-0] 
>> Session timeout has elapsed while SUSPENDED. Injecting a session expiration. 
>> Elapsed ms: 40002. Adjusted session timeout ms: 4
>> 2020-02-26 12:39:13.109+ [L: WARN] [O: o.a.c.f.s.ConnectionStateManager] 
>> [I: ] [U: ] [S: ] [P: platform2] [T: Curator-ConnectionStateManager-0] 
>> Session timeout has elapsed while SUSPENDED. Injecting a session expiration. 
>> Elapsed ms: 40002. Adjusted session timeout ms: 4
>> 2020-02-26 12:39:13.111+ [L: WARN] [O: o.a.c.ConnectionState] [I: ] [U: 
>> ] [S: ] [P: platform2] [T: localhost-startStop-1-EventThread] Session 
>> expired event received
>> 2020-02-26 12:39:13.111+ [L: WARN] [O: o.a.c.ConnectionState] [I: ] [U: 
>> ] [S: ] [P: platform2] [T: localhost-startStop-1-EventThread] Session 
>> expired event received
>> 2020-02-26 12:39:13.111+ [L: WARN] [O: o.a.c.ConnectionState] [I: ] [U: 
>> ] [S: ] [P: platform2] [T: ForkJoinPool.commonPool-worker-1-EventThread] 
>> Session expired event received
>> 2020-02-26 12:39:24.405+ [L: ERROR] [O: 
>> c.t.s.c.ClusteredModeCoordinator] [I: ] [U: ] [S: ] [P: platform2] [T: 
>> Curator-ConnectionStateManager-0] Could not connect to ZK. Shutting down 
>> server
>> 
>> As you can see in above logs, the message "ZOOKEEPER STATE CHANGED TO : 
>> LOST" is missing 
>> 
>> private final void handleConnectionStateChange(final ConnectionState 
>> newConnectionState) {
>> 
>> if 
>> (ConnectionState.LOST.name().equalsIgnoreCase(newConnectionState.name())) {
>> _logger.error("Could not connect to ZK. Shutting down server..");
>> shutDownPlatform(1);
>> } else if (_isSingletonServer && 
>> ConnectionState.SUSPENDED.name().equalsIgnoreCase(newConnectionState.name()))
>>  {
>> notifyClusterModeCoordinatorListenersOfSingletonRoleLoss();
>>     }
>> }
>> 
>> private final void handleTakeSingletonRole() throws Exception {
>> 
>> _logger.info("handleTakeSingletonRole: Receive

Re: Curator listener issue

2020-02-26 Thread Jordan Zimmerman
You are shutting down your server in the handleConnectionStateChange() handler. 
So, maybe your application shutsdown/closes the Curator handle before the other 
handler gets called. i.e. a kind of race.

-Jordan

> On Feb 26, 2020, at 10:15 AM, Arpit Jain  wrote:
> 
> What is RetryLoopExecutor? Curator does all retries internally. You should 
> not have your own retry mechanism.
> Its just a retry loop for the submitted action. Will remove it if its already 
> done by curator.
> Calling blockUntilConnectedOrTimedOut() is unnecessary. Curator does this 
> internally already.
> Will remove it
> What do you do with the "isConnected" value? That seems suspicious to me.
> Its not used anywhere. Just for logging if we get connected or not
> You do not get LOST until the session expires. How long is 
> "coordinatorSessionTimeout"? You won't receive LOST until that has elapsed.
>   "ConnectionTimeout": 1,
>   "Hosts": "localhost:2181",
>   "MaxRetries": 3,
>   "RetryTimeout": 3000,
>   "SessionTimeout": 9,
> As I said earlier, I am receiving LOST on both application instances. Its 
> only that specific listener is not getting called. Here are the logs 
> 
> [L: WARN] [O: c.t.s.c.ZookeeperHelper] [I: ] [U: ] [S: ] [P: platform2] [T: 
> Curator-ConnectionStateManager-0] ZOOKEEPER STATE CHANGED TO : SUSPENDED
> 2020-02-26 12:39:13.107+ [L: WARN] [O: o.a.c.f.s.ConnectionStateManager] 
> [I: ] [U: ] [S: ] [P: platform2] [T: Curator-ConnectionStateManager-0] 
> Session timeout has elapsed while SUSPENDED. Injecting a session expiration. 
> Elapsed ms: 40003. Adjusted session timeout ms: 4
> 2020-02-26 12:39:13.109+ [L: WARN] [O: o.a.c.f.s.ConnectionStateManager] 
> [I: ] [U: ] [S: ] [P: platform2] [T: Curator-ConnectionStateManager-0] 
> Session timeout has elapsed while SUSPENDED. Injecting a session expiration. 
> Elapsed ms: 40002. Adjusted session timeout ms: 4
> 2020-02-26 12:39:13.109+ [L: WARN] [O: o.a.c.f.s.ConnectionStateManager] 
> [I: ] [U: ] [S: ] [P: platform2] [T: Curator-ConnectionStateManager-0] 
> Session timeout has elapsed while SUSPENDED. Injecting a session expiration. 
> Elapsed ms: 40002. Adjusted session timeout ms: 4
> 2020-02-26 12:39:13.111+ [L: WARN] [O: o.a.c.ConnectionState] [I: ] [U: ] 
> [S: ] [P: platform2] [T: localhost-startStop-1-EventThread] Session expired 
> event received
> 2020-02-26 12:39:13.111+ [L: WARN] [O: o.a.c.ConnectionState] [I: ] [U: ] 
> [S: ] [P: platform2] [T: localhost-startStop-1-EventThread] Session expired 
> event received
> 2020-02-26 12:39:13.111+ [L: WARN] [O: o.a.c.ConnectionState] [I: ] [U: ] 
> [S: ] [P: platform2] [T: ForkJoinPool.commonPool-worker-1-EventThread] 
> Session expired event received
> 2020-02-26 12:39:24.405+ [L: ERROR] [O: c.t.s.c.ClusteredModeCoordinator] 
> [I: ] [U: ] [S: ] [P: platform2] [T: Curator-ConnectionStateManager-0] Could 
> not connect to ZK. Shutting down server
> 
> As you can see in above logs, the message "ZOOKEEPER STATE CHANGED TO : LOST" 
> is missing 
> 
> private final void handleConnectionStateChange(final ConnectionState 
> newConnectionState) {
> 
> if 
> (ConnectionState.LOST.name().equalsIgnoreCase(newConnectionState.name())) {
> _logger.error("Could not connect to ZK. Shutting down server..");
> shutDownPlatform(1);
> } else if (_isSingletonServer && 
> ConnectionState.SUSPENDED.name().equalsIgnoreCase(newConnectionState.name())) 
> {
> notifyClusterModeCoordinatorListenersOfSingletonRoleLoss();
> }
> }
> 
> private final void handleTakeSingletonRole() throws Exception {
> 
> _logger.info("handleTakeSingletonRole: Received Singleton Role.");
> 
> // We have just become the Singleton Server. Ha!
> try {
> while (true) {
> try {
> Thread.sleep(1000);
> } catch (Exception e) {
> // ignore interrupted exception
> }
> }
> }
> 
> Thanks
> 
> On Wed, Feb 26, 2020 at 2:53 PM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> A few things:
> 
> What is RetryLoopExecutor? Curator does all retries internally. You should 
> not have your own retry mechanism.
> Calling blockUntilConnectedOrTimedOut() is unnecessary. Curator does this 
> internally already.
> What do you do with the "isConnected" value? That seems suspicious to me.
> You do not get LOST until the session expires. How long is 
> "coordinatorSessionTimeout"

Re: Curator state change listener

2020-02-26 Thread Jordan Zimmerman
This isn't enough information in your email to give an answer. Please provide a 
test case.

-Jordan

> On Feb 26, 2020, at 7:46 AM, Arpit Jain  wrote:
> 
> 1. I am definitely getting a LOST message on both applications because I am 
> shutting down application based on LOST state. This works fine. The only 
> problem I am seeing is the below listener does not gets called every time on 
> either of the application
>   
>curatorFramework.getConnectionStateListenable()
> .addListener((client, newState) -> LOGGER.warn("ZOOKEEPER STATE 
> CHANGED TO : {}", newState));
> 
> This does not harm us but just wanted to highlight.
> 
> 2. Also on the leader side of my application, I am seeing below message
>[L: ERROR] [O: o.a.c.f.r.l.LeaderSelector] [I: ] [U: ] [S: ] [P: 
> platform1] [T: Curator-LeaderSelector-0] The leader threw an exception
> 
> Thanks
> 
> 
> On Tue, Feb 25, 2020 at 7:43 PM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> I'd need a lot more information to understand why. I suggest you set up a 
> test case so we can look at it.
> 
> -Jordan
> 
>> On Feb 25, 2020, at 2:42 PM, Arpit Jain > <mailto:jain.arp...@gmail.com>> wrote:
>> 
>> There is only 1 zookeeper instance. Both curator logs says SUSPENDED but 
>> only one of them gets LOST.
>> 
>> On Tue, Feb 25, 2020, 6:32 PM Jordan Zimmerman > <mailto:jor...@jordanzimmerman.com>> wrote:
>> There's really only 1 ZooKeeper instance? If so, it's not possible that 
>> Curator won't report SUSPENDED and then LOST. If you have more than 1 
>> ZooKeeper instance, how do you know both clients are connecting to the same 
>> instance? Likely they're not.
>> 
>> -Jordan
>> 
>>> On Feb 25, 2020, at 1:27 PM, Arpit Jain >> <mailto:jain.arp...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> I have a Zookeeper node and 2 instances of my application using Curator 
>>> client connected to same Zookeeper node. When I am killing the Zookeeper 
>>> node, only one of the curator instance gets a LOST notification. Both 
>>> applications are able to detect that Zookeeper is down but below log 
>>> message in listener appears only on one of the application instance.
>>> Below is the code where I I create and register listener 
>>> 
>>> final CuratorFramework curatorFramework =
>>> CuratorFrameworkFactory.newClient(coordinatorHosts, 
>>> coordinatorSessionTimeout, coordinatorConnectionTimeout, retryPolicy);
>>> 
>>> curatorFramework.getConnectionStateListenable()
>>> .addListener((client, newState) -> LOGGER.warn("ZOOKEEPER STATE 
>>> CHANGED TO : {}", newState));
>>> 
>>> Any ideas why this behaviour ?
>>> 
>>> Thanks
>> 
> 



Re: Curator state change listener

2020-02-25 Thread Jordan Zimmerman
I'd need a lot more information to understand why. I suggest you set up a test 
case so we can look at it.

-Jordan

> On Feb 25, 2020, at 2:42 PM, Arpit Jain  wrote:
> 
> There is only 1 zookeeper instance. Both curator logs says SUSPENDED but only 
> one of them gets LOST.
> 
> On Tue, Feb 25, 2020, 6:32 PM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> There's really only 1 ZooKeeper instance? If so, it's not possible that 
> Curator won't report SUSPENDED and then LOST. If you have more than 1 
> ZooKeeper instance, how do you know both clients are connecting to the same 
> instance? Likely they're not.
> 
> -Jordan
> 
>> On Feb 25, 2020, at 1:27 PM, Arpit Jain > <mailto:jain.arp...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> I have a Zookeeper node and 2 instances of my application using Curator 
>> client connected to same Zookeeper node. When I am killing the Zookeeper 
>> node, only one of the curator instance gets a LOST notification. Both 
>> applications are able to detect that Zookeeper is down but below log message 
>> in listener appears only on one of the application instance.
>> Below is the code where I I create and register listener 
>> 
>> final CuratorFramework curatorFramework =
>> CuratorFrameworkFactory.newClient(coordinatorHosts, 
>> coordinatorSessionTimeout, coordinatorConnectionTimeout, retryPolicy);
>> 
>> curatorFramework.getConnectionStateListenable()
>> .addListener((client, newState) -> LOGGER.warn("ZOOKEEPER STATE 
>> CHANGED TO : {}", newState));
>> 
>> Any ideas why this behaviour ?
>> 
>> Thanks
> 



Re: Curator state change listener

2020-02-25 Thread Jordan Zimmerman
There's really only 1 ZooKeeper instance? If so, it's not possible that Curator 
won't report SUSPENDED and then LOST. If you have more than 1 ZooKeeper 
instance, how do you know both clients are connecting to the same instance? 
Likely they're not.

-Jordan

> On Feb 25, 2020, at 1:27 PM, Arpit Jain  wrote:
> 
> Hi,
> 
> I have a Zookeeper node and 2 instances of my application using Curator 
> client connected to same Zookeeper node. When I am killing the Zookeeper 
> node, only one of the curator instance gets a LOST notification. Both 
> applications are able to detect that Zookeeper is down but below log message 
> in listener appears only on one of the application instance.
> Below is the code where I I create and register listener 
> 
> final CuratorFramework curatorFramework =
> CuratorFrameworkFactory.newClient(coordinatorHosts, 
> coordinatorSessionTimeout, coordinatorConnectionTimeout, retryPolicy);
> 
> curatorFramework.getConnectionStateListenable()
> .addListener((client, newState) -> LOGGER.warn("ZOOKEEPER STATE 
> CHANGED TO : {}", newState));
> 
> Any ideas why this behaviour ?
> 
> Thanks



Re: ZookKeeper 3.5.7

2020-02-18 Thread Jordan Zimmerman
https://github.com/apache/curator/pull/346

> On Feb 17, 2020, at 11:13 PM, Cameron McKenzie  wrote:
> 
> Yep, agreed.
> 
> On Tue, Feb 18, 2020 at 3:01 PM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> Hey Cameron,
> 
> Let's wait for CURATOR-559 <https://issues.apache.org/jira/browse/CURATOR-559 
> <https://issues.apache.org/jira/browse/CURATOR-559>> - I think this needs 
> immediate fixing. Very surprising we haven't heard about this before.
> 
> -Jordan
> 
> > On Feb 17, 2020, at 8:29 AM, Jordan Zimmerman  > <mailto:jor...@jordanzimmerman.com>> wrote:
> > 
> > Sounds good, Cameron. Start the release when you can.
> > 
> > -Jordan
> > 
> >> On Feb 16, 2020, at 4:59 PM, Cameron McKenzie  >> <mailto:mckenzie@gmail.com> <mailto:mckenzie@gmail.com 
> >> <mailto:mckenzie@gmail.com>>> wrote:
> >> 
> >> Sounds like a good idea to me. I can do the release if you like, but I 
> >> might not get a chance until next week.
> >> cheers
> >> 
> >> On Mon, Feb 17, 2020 at 8:50 AM Jordan Zimmerman 
> >> mailto:jor...@jordanzimmerman.com> 
> >> <mailto:jor...@jordanzimmerman.com <mailto:jor...@jordanzimmerman.com>>> 
> >> wrote:
> >> Hi Folks,
> >> 
> >> ZooKeeper 3.5.7 was just released. What do you think about a Curator 
> >> release with this ZK version and all recently merged PRs? Might be a nice 
> >> cap to the Curator 4.x line given that the next version will likely be 
> >> Curator 5.0. Anyone want to run the release? Or I can do it. 
> >> 
> >> 
> >> Jordan Zimmerman
> > 
> 



Re: ZookKeeper 3.5.7

2020-02-17 Thread Jordan Zimmerman
Hey Cameron,

Let's wait for CURATOR-559 <https://issues.apache.org/jira/browse/CURATOR-559> 
- I think this needs immediate fixing. Very surprising we haven't heard about 
this before.

-Jordan

> On Feb 17, 2020, at 8:29 AM, Jordan Zimmerman  
> wrote:
> 
> Sounds good, Cameron. Start the release when you can.
> 
> -Jordan
> 
>> On Feb 16, 2020, at 4:59 PM, Cameron McKenzie > <mailto:mckenzie@gmail.com>> wrote:
>> 
>> Sounds like a good idea to me. I can do the release if you like, but I might 
>> not get a chance until next week.
>> cheers
>> 
>> On Mon, Feb 17, 2020 at 8:50 AM Jordan Zimmerman > <mailto:jor...@jordanzimmerman.com>> wrote:
>> Hi Folks,
>> 
>> ZooKeeper 3.5.7 was just released. What do you think about a Curator release 
>> with this ZK version and all recently merged PRs? Might be a nice cap to the 
>> Curator 4.x line given that the next version will likely be Curator 5.0. 
>> Anyone want to run the release? Or I can do it. 
>> 
>> 
>> Jordan Zimmerman
> 



Re: ZookKeeper 3.5.7

2020-02-17 Thread Jordan Zimmerman
Sounds good, Cameron. Start the release when you can.

-Jordan

> On Feb 16, 2020, at 4:59 PM, Cameron McKenzie  wrote:
> 
> Sounds like a good idea to me. I can do the release if you like, but I might 
> not get a chance until next week.
> cheers
> 
> On Mon, Feb 17, 2020 at 8:50 AM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> Hi Folks,
> 
> ZooKeeper 3.5.7 was just released. What do you think about a Curator release 
> with this ZK version and all recently merged PRs? Might be a nice cap to the 
> Curator 4.x line given that the next version will likely be Curator 5.0. 
> Anyone want to run the release? Or I can do it. 
> 
> 
> Jordan Zimmerman



ZookKeeper 3.5.7

2020-02-16 Thread Jordan Zimmerman
Hi Folks,

ZooKeeper 3.5.7 was just released. What do you think about a Curator release 
with this ZK version and all recently merged PRs? Might be a nice cap to the 
Curator 4.x line given that the next version will likely be Curator 5.0. Anyone 
want to run the release? Or I can do it. 


Jordan Zimmerman

Re: Update ACL of a znode

2020-01-18 Thread Jordan Zimmerman
Yes. Curator supports all of ZooKeeper’s APIs. CuratorFramework has a setAcl() 
DSL method. 


Jordan Zimmerman

> On Jan 18, 2020, at 1:57 PM, Arpit Jain  wrote:
> 
> 
> Hi,
> 
> Is it possible to update ACL of an already created znode using curator API?
> 
> Thanks


Re: Curator client for SASL authentication

2020-01-14 Thread Jordan Zimmerman
I saw the conversation in the zookeeper list. I’m traveling at the moment. If 
someone else doesn’t get to this. I’ll check when I can. 


Jordan Zimmerman

> On Jan 14, 2020, at 6:05 PM, Arpit Jain  wrote:
> 
> 
> Hi,
> 
> I am using SASL Kerberos based authentication between Zookeeper and Curator. 
> Is below the correct way to create client with SASL authentication ?
> 
>CuratorFrameworkFactory.Builder builder =
> 
> CuratorFrameworkFactory.builder().connectString(coordinatorHosts).retryPolicy(retryPolicy)
> 
> .connectionTimeoutMs(coordinatorConnectionTimeout).sessionTimeoutMs(coordinatorSessionTimeout);
> 
> final CuratorFramework curatorFramework =
> builder.authorization("sasl", 
> "zkcli...@example.com".getBytes()).aclProvider(new ACLProvider() {
> @Override
> public List getDefaultAcl() {
> return ZooDefs.Ids.CREATOR_ALL_ACL;
> }
> 
> @Override
> public List getAclForPath(String path) {
> return ZooDefs.Ids.CREATOR_ALL_ACL;
> }
> }).build();
> curatorFramework.start();
> 
> curatorFramework.create().withMode(CreateMode.CONTAINER).forPath("/MyNode");
> 
> Thanks


ASF Slack

2019-10-23 Thread Jordan Zimmerman
Hey Folks,

Apache has a Slack server. I've created a Curator channel there. Committers, in 
particular, please join. It will make coordination on releases/issues much 
easier.

-Jordan

ASF Slack: https://the-asf.slack.com 

Curator Channel: #curator 




Re: Leader election and leader operation based on zookeeper

2019-10-01 Thread Jordan Zimmerman
Yes, I think this is a hole. As I've thought more about it I think the method 
you described using the lock node in the transaction is actually the best.

-JZ

> On Sep 29, 2019, at 11:41 PM, Zili Chen  wrote:
> 
> Hi Jordan,
> 
> Here is a possible edge case of coordination node way.
> 
> When an instance becomes leader it:
> Gets the version of the coordination ZNode
> Sets the data for that ZNode (the contents don't matter) using the retrieved 
> version number
> If the set succeeds you can be assured you are currently leader (otherwise 
> release leadership and re-contend)
> Save the new version
> 
> Actually, it is NOT atomic that an instance becomes leader and it gets the 
> version of the coordination znode. So an edge case is,
> 
> 1. instance-1 becomes leader, trying to get the version of the coordination 
> znode.
> 2. instance-2 becomes leader, update the coordination znode.
> 3. instance-1 gets the newer version and re-update the coordination znode.
> 
> Generally speaking instance-1 suffers session expire but since Curator 
> retries on session expire that cases above is possible. Although
> instance-2 will be mislead that itself not the leader and give up leadership 
> so that the algorithm can proceed and instance-1 will be
> asynchronously notified it is not the leader, before the notification 
> instance-1 possibly performs some operations already.
> 
> Curator should ensure that instance-1 will not regard itself as the leader 
> with some synchronize logic. Or just use a cached leader latch path
> for checking because the leader latch path when it becomes leader is 
> synchronized to be the exact one. To be more clear, for leader latch
> path, I don't mean the volatile field, but the one cached when it becomes 
> leader.
> 
> Best,
> tison.
> 
> 
> Zili Chen mailto:wander4...@gmail.com>> 于2019年9月22日周日 
> 上午2:43写道:
> >the Curator recipes delete and recreate their paths
> 
> However, as mentioned above, we do a one-shot election(doesn't reuse the 
> curator recipe) so that
> we check the latch path is always the path in the epoch the contender becomes 
> leader. You can check
> out an implementation of the design here[1]. Even we want to enable 
> re-contending we can set a guard
> 
> (change state -> track latch path)
> 
> and check the state in LEADING && path existence. ( so we don't misleading 
> and check a wrong path )
> 
> Checking version and a coordinate znode sounds another valid solution. I'm 
> glad to see it in the future
> Curator version and if there is a valid ticket I can help to dig out a bit :-)
> 
> Best,
> tison.
> 
> [1] 
> https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java
>  
> <https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java>
> 
> Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> 于2019年9月22日周日 上午2:31写道:
> The issue is that the leader path doesn't stay constant. Every time there is 
> a network partition, etc. the Curator recipes delete and recreate their 
> paths. So, I'm concerned that client code trying to keep track of the leader 
> path would be error prone (it's one reason that they aren't public - it's 
> volatile internal state).
> 
> -Jordan
> 
>> On Sep 21, 2019, at 1:26 PM, Zili Chen > <mailto:wander4...@gmail.com>> wrote:
>> 
>> Hi Jordan,
>> 
>> >I think using the leader path may not work
>> 
>> could you share a situation where this strategy does not work? For the 
>> design we do leader contending
>> one-shot and when perform a transaction, checking the existence of latch 
>> path && in state LEADING.
>> 
>> Given the election algorithm works, state transited to LEADING when its 
>> latch path once became
>> the smallest sequential znode. So the existence of latch path guarding that 
>> nobody else becoming leader.
>> 
>> 
>> Jordan Zimmerman > <mailto:jor...@jordanzimmerman.com>> 于2019年9月22日周日 上午12:58写道:
>> Yeah, Ted - I think this is basically the same thing. We should all try to 
>> poke holes in this.
>> 
>> -JZ
>> 
>>> On Sep 21, 2019, at 11:54 AM, Ted Dunning >> <mailto:ted.dunn...@gmail.com>> wrote:
>>> 
>>> 
>>> I would suggest that using an epoch number stored in ZK might be helpful. 
>>> Every operation that the master takes could be made conditional on the 
>>> epoch

Re: Leader election and leader operation based on zookeeper

2019-09-21 Thread Jordan Zimmerman
Thinking more about this... I imagine this works if the current leader path is 
always used. I need to think about this some more. 

-JZ

> On Sep 21, 2019, at 1:31 PM, Jordan Zimmerman  
> wrote:
> 
> The issue is that the leader path doesn't stay constant. Every time there is 
> a network partition, etc. the Curator recipes delete and recreate their 
> paths. So, I'm concerned that client code trying to keep track of the leader 
> path would be error prone (it's one reason that they aren't public - it's 
> volatile internal state).
> 
> -Jordan
> 
>> On Sep 21, 2019, at 1:26 PM, Zili Chen > <mailto:wander4...@gmail.com>> wrote:
>> 
>> Hi Jordan,
>> 
>> >I think using the leader path may not work
>> 
>> could you share a situation where this strategy does not work? For the 
>> design we do leader contending
>> one-shot and when perform a transaction, checking the existence of latch 
>> path && in state LEADING.
>> 
>> Given the election algorithm works, state transited to LEADING when its 
>> latch path once became
>> the smallest sequential znode. So the existence of latch path guarding that 
>> nobody else becoming leader.
>> 
>> 
>> Jordan Zimmerman > <mailto:jor...@jordanzimmerman.com>> 于2019年9月22日周日 上午12:58写道:
>> Yeah, Ted - I think this is basically the same thing. We should all try to 
>> poke holes in this.
>> 
>> -JZ
>> 
>>> On Sep 21, 2019, at 11:54 AM, Ted Dunning >> <mailto:ted.dunn...@gmail.com>> wrote:
>>> 
>>> 
>>> I would suggest that using an epoch number stored in ZK might be helpful. 
>>> Every operation that the master takes could be made conditional on the 
>>> epoch number using a multi-transaction.
>>> 
>>> Unfortunately, as you say, you have to have the update of the epoch be 
>>> atomic with becoming leader. 
>>> 
>>> The natural way to do this is to have an update of an epoch file be part of 
>>> the leader election, but that probably isn't possible using Curator. The 
>>> way I would tend to do it would be have a persistent file that is updated 
>>> atomically as part of leader election. The version of that persistent file 
>>> could then be used as the epoch number. All updates to files that are gated 
>>> on the epoch number would only proceed if no other master has been elected, 
>>> at least if you use the sync option.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen >> <mailto:wander4...@gmail.com>> wrote:
>>> Hi ZooKeepers,
>>> 
>>> Recently there is an ongoing refactor[1] in Flink community aimed at
>>> overcoming several inconsistent state issues on ZK we have met. I come
>>> here to share our design of leader election and leader operation. For
>>> leader operation, it is operation that should be committed only if the
>>> contender is the leader. Also CC Curator mailing list because it also
>>> contains the reason why we cannot JUST use Curator.
>>> 
>>> The rule we want to keep is
>>> 
>>> **Writes on ZK must be committed only if the contender is the leader**
>>> 
>>> We represent contender by an individual ZK client. At the moment we use
>>> Curator for leader election so the algorithm is the same as the
>>> optimized version in this page[2].
>>> 
>>> The problem is that this algorithm only take care of leader election but
>>> is indifferent to subsequent operations. Consider the scenario below:
>>> 
>>> 1. contender-1 becomes the leader
>>> 2. contender-1 proposes a create txn-1
>>> 3. sender thread suspended for full gc
>>> 4. contender-1 lost leadership and contender-2 becomes the leader
>>> 5. contender-1 recovers from full gc, before it reacts to revoke
>>> leadership event, txn-1 retried and sent to ZK.
>>> 
>>> Without other guard txn will success on ZK and thus contender-1 commit
>>> a write operation even if it is no longer the leader. This issue is
>>> also documented in this note[3].
>>> 
>>> To overcome this issue instead of just saying that we're unfortunate,
>>> we draft two possible solution.
>>> 
>>> The first is document here[4]. Briefly, when the contender becomes the
>>> leader, we memorize the latch path at that moment. And for
>>> subsequent operations, we do in a transaction first checking the
>>> existence of the latch path. Leadership is only swi

Re: Leader election and leader operation based on zookeeper

2019-09-21 Thread Jordan Zimmerman
The issue is that the leader path doesn't stay constant. Every time there is a 
network partition, etc. the Curator recipes delete and recreate their paths. 
So, I'm concerned that client code trying to keep track of the leader path 
would be error prone (it's one reason that they aren't public - it's volatile 
internal state).

-Jordan

> On Sep 21, 2019, at 1:26 PM, Zili Chen  wrote:
> 
> Hi Jordan,
> 
> >I think using the leader path may not work
> 
> could you share a situation where this strategy does not work? For the design 
> we do leader contending
> one-shot and when perform a transaction, checking the existence of latch path 
> && in state LEADING.
> 
> Given the election algorithm works, state transited to LEADING when its latch 
> path once became
> the smallest sequential znode. So the existence of latch path guarding that 
> nobody else becoming leader.
> 
> 
> Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> 于2019年9月22日周日 上午12:58写道:
> Yeah, Ted - I think this is basically the same thing. We should all try to 
> poke holes in this.
> 
> -JZ
> 
>> On Sep 21, 2019, at 11:54 AM, Ted Dunning > <mailto:ted.dunn...@gmail.com>> wrote:
>> 
>> 
>> I would suggest that using an epoch number stored in ZK might be helpful. 
>> Every operation that the master takes could be made conditional on the epoch 
>> number using a multi-transaction.
>> 
>> Unfortunately, as you say, you have to have the update of the epoch be 
>> atomic with becoming leader. 
>> 
>> The natural way to do this is to have an update of an epoch file be part of 
>> the leader election, but that probably isn't possible using Curator. The way 
>> I would tend to do it would be have a persistent file that is updated 
>> atomically as part of leader election. The version of that persistent file 
>> could then be used as the epoch number. All updates to files that are gated 
>> on the epoch number would only proceed if no other master has been elected, 
>> at least if you use the sync option.
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen > <mailto:wander4...@gmail.com>> wrote:
>> Hi ZooKeepers,
>> 
>> Recently there is an ongoing refactor[1] in Flink community aimed at
>> overcoming several inconsistent state issues on ZK we have met. I come
>> here to share our design of leader election and leader operation. For
>> leader operation, it is operation that should be committed only if the
>> contender is the leader. Also CC Curator mailing list because it also
>> contains the reason why we cannot JUST use Curator.
>> 
>> The rule we want to keep is
>> 
>> **Writes on ZK must be committed only if the contender is the leader**
>> 
>> We represent contender by an individual ZK client. At the moment we use
>> Curator for leader election so the algorithm is the same as the
>> optimized version in this page[2].
>> 
>> The problem is that this algorithm only take care of leader election but
>> is indifferent to subsequent operations. Consider the scenario below:
>> 
>> 1. contender-1 becomes the leader
>> 2. contender-1 proposes a create txn-1
>> 3. sender thread suspended for full gc
>> 4. contender-1 lost leadership and contender-2 becomes the leader
>> 5. contender-1 recovers from full gc, before it reacts to revoke
>> leadership event, txn-1 retried and sent to ZK.
>> 
>> Without other guard txn will success on ZK and thus contender-1 commit
>> a write operation even if it is no longer the leader. This issue is
>> also documented in this note[3].
>> 
>> To overcome this issue instead of just saying that we're unfortunate,
>> we draft two possible solution.
>> 
>> The first is document here[4]. Briefly, when the contender becomes the
>> leader, we memorize the latch path at that moment. And for
>> subsequent operations, we do in a transaction first checking the
>> existence of the latch path. Leadership is only switched if the latch
>> gone, and all operations will fail if the latch gone.
>> 
>> The second is still rough. Basically it relies on session expire
>> mechanism in ZK. We will adopt the unoptimized version in the
>> recipe[2] given that in our scenario there are only few contenders
>> at the same time. Thus we create /leader node as ephemeral znode with
>> leader information and when session expired we think leadership is
>> revoked and terminate the contender. Asynchronous write operations
>> should not succeed because they will all fail on session expire.
>> 
>> We cannot 

Re: Leader election and leader operation based on zookeeper

2019-09-20 Thread Jordan Zimmerman
> It seems Curator does not expose session id

you can always access the ZooKeeper handle directly to get the session ID:

CuratorFramework curator = ...
curator.getZookeeperClient().getZooKeeper()

-JZ

> On Sep 20, 2019, at 10:21 PM, Zili Chen  wrote:
> 
> >>I am assuming the "write operation" here is write to ZooKeeper
> 
> Yes.
> 
> >>Looks like contender-1 was not reusing same ZooKeeper client object, so 
> >>this explains how the previous supposed to be fail operation succeeds?
> 
> Yes. Our communication to ZK is based on Curator, which will re-instance a 
> client and retry the operation. Due to asynchronously schedule the error 
> execute order is possible.
> 
> >>record the session ID and don't commit any write operations if session ID 
> >>changes.
> 
> Sounds reasonable. Currently in our ongoing design we treat the latch path as 
> "session id" so we use multi-op to atomically verify it.
> It seems Curator does not expose session id. And in my option 2 above even I 
> think of falling back to zookeeper so that we just fail on
> session expired and re-instance another contender, contending for leadership. 
> This will save us from maintaining mutable state during
> leadership epoch(to be clear, Flink scope leadership, not ZK).
> 
> Best,
> tison.
> 
> 
> Michael Han mailto:h...@apache.org>> 于2019年9月21日周六 上午4:03写道:
> >> thus contender-1 commit a write operation even if it is no longer the 
> >> leader
> 
> I am assuming the "write operation" here is write to ZooKeeper (as opposed to 
> write to an external storage system)? If so:
> 
> >> contender-1 recovers from full gc, before it reacts to revoke leadership 
> >> event, txn-1 retried and sent to ZK.
> 
> contender-2 becomes the leader implies that the ephemeral node appertains to 
> contender-1 has been removed, which further implies that the session 
> appertains to contender-1 is either explicitly closed (by client), or 
> expired. So if contender-1 was still using same client ZooKeeper object, then 
> it's not possible for txn-1 succeeded as session expire was an event ordered 
> prior to txn-1, which wouldn't commit after an expired session.
> 
> >> Curator always creates a new client on session expire and retry the 
> >> operation.
> Looks like contender-1 was not reusing same ZooKeeper client object, so this 
> explains how the previous supposed to be fail operation succeeds?
> 
> If my reasoning make sense, one idea might be on Flink side, once you finish 
> leader election with ZK, record the session ID and don't commit any write 
> operations if session ID changes.
> 
> The fencing token + multi might also work, but that sounds a little bit 
> heavier. 
> 
> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen  > wrote:
> Hi ZooKeepers,
> 
> Recently there is an ongoing refactor[1] in Flink community aimed at
> overcoming several inconsistent state issues on ZK we have met. I come
> here to share our design of leader election and leader operation. For
> leader operation, it is operation that should be committed only if the
> contender is the leader. Also CC Curator mailing list because it also
> contains the reason why we cannot JUST use Curator.
> 
> The rule we want to keep is
> 
> **Writes on ZK must be committed only if the contender is the leader**
> 
> We represent contender by an individual ZK client. At the moment we use
> Curator for leader election so the algorithm is the same as the
> optimized version in this page[2].
> 
> The problem is that this algorithm only take care of leader election but
> is indifferent to subsequent operations. Consider the scenario below:
> 
> 1. contender-1 becomes the leader
> 2. contender-1 proposes a create txn-1
> 3. sender thread suspended for full gc
> 4. contender-1 lost leadership and contender-2 becomes the leader
> 5. contender-1 recovers from full gc, before it reacts to revoke
> leadership event, txn-1 retried and sent to ZK.
> 
> Without other guard txn will success on ZK and thus contender-1 commit
> a write operation even if it is no longer the leader. This issue is
> also documented in this note[3].
> 
> To overcome this issue instead of just saying that we're unfortunate,
> we draft two possible solution.
> 
> The first is document here[4]. Briefly, when the contender becomes the
> leader, we memorize the latch path at that moment. And for
> subsequent operations, we do in a transaction first checking the
> existence of the latch path. Leadership is only switched if the latch
> gone, and all operations will fail if the latch gone.
> 
> The second is still rough. Basically it relies on session expire
> mechanism in ZK. We will adopt the unoptimized version in the
> recipe[2] given that in our scenario there are only few contenders
> at the same time. Thus we create /leader node as ephemeral znode with
> leader information and when session expired we think leadership is
> revoked and terminate the contender. Asynchronous write operations
> should not succeed 

Re: Enable configuration to replace usage of RetryLoop with SessionFailRetryLoop

2019-09-20 Thread Jordan Zimmerman
Yes, I think so. It's been a very long time since I've looked at 
SessionFailRetryLoop and it was written by someone other than me. But, the 
general answer is this: 

The ConnectionHandlingPolicy controls how the retry loop handling is done. See, 
StandardConnectionHandlingPolicy.java. You can set a different 
ConnectionHandlingPolicy in the CuratorFrameworkFactory when creating the 
Curator instance. Write a custom handling policy that implements 
callWithRetry() however you need to.

-Jordan

> On Sep 19, 2019, at 9:33 AM, Zili Chen  wrote:
> 
> Hi Curators,
> 
> I notice we have a SessionFailRetryLoop which fails all operations if
> session expired. Is it possible we have an option to use
> SessionFailRetryLoop for all internal retry-able operations?
> 
> Best,
> tison.



Re: Concurrent createContainers

2019-09-13 Thread Jordan Zimmerman
Yeah - that would be something else as you correctly note ZKPaths catches 
KeeperException.NodeExistsException internally.

> On Sep 11, 2019, at 9:48 PM, Zili Chen  wrote:
> 
> I notice that methods mentioned above internally call ZKPaths.mkdirs which
> tolerates KeeperException.NodeExistException. So it is not a problem.
> 
> Correct me if I am wrong.
> 
> Best,
> tison.
> 
> 
> Zili Chen mailto:wander4...@gmail.com>> 于2019年9月12日周四 
> 上午7:57写道:
> Hi Curators,
> 
> If there are concurrent createContainers or creatingParentContainersIfNeeded, 
> is it a race condition that one of
> them fails with KeeperException.NodeExistException?
> 
> Said thread-1 calls createContainers("/a/b/c/d") while thread-2 calls
> createContainers("/a/b/c/e"), I would expect it works as
> 
> $ mkdir -p a/b/c/d
> $ mkdir -p a/b/c/e
> 
> all directory made and no exception thrown.
> 
> Best,
> tison.



Re: LeaderLatch recipe could run into two leaders

2019-08-14 Thread Jordan Zimmerman
I did a quick read, do I understand that adding a method to return the 
LeaderLatch's latchPath will solve the issue for you?

-Jordan

> On Aug 14, 2019, at 5:49 PM, Zili Chen  wrote:
> 
> Re-active this thread follow the discussion in ZooKeeper list
> and try to confirm my understanding on this leader election topic.
> 
> The situation we run into trouble is described as above, and
> the solution we found is documented here[1]. See also point 1 in
> "Motivation" section and the first paragraph of "Proposal Design"
> section.
> 
> Best,
> tison.
> 
> [1] 
> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing
>  
> <https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing>
> 
> Zili Chen mailto:wander4...@gmail.com>> 于2019年3月28日周四 
> 上午8:51写道:
> Hi Cameron & Jordan,
> 
> Thanks for your replies! I think our case is similar with that posted in the 
> Tech Note.
> 
> Briefly in our situation,
> 
> 1. contender-1 was elected as the leader and serving.
> 2-1. contender-1 lost connect with zookeeper and thus the ephemeral node 
> deleted.
> Contender-1#notLeader send a message to contender-1 to tell that itself
> was no longer the leader, but before this message processed, messages cause
> actions which only the leader can perform processed.
> 2-2. Concurrently, contender-2 was elected as the new leader and serving as 
> the leader.
> 3. Before contender-1#notLeader's message got properly processed, both of the 
> contenders
> regarded themselves as the leader.
> 
> This is not a rare case in our situation since we cannot guarantee how many 
> messages
> before NotLeader would be processed and quite often we cannot revoke 
> leadership timely...
> 
> As an optional resolution, we'd like to perform "leader-only actions" in a 
> transaction
> checking the existence of election node path(ourPath), which would be like
> 
> curatorFramework.inTransaction()
> .check().forPath(election-node-path).and()
> .setData().forPath(...).and().commit();
> 
> to overcome this edge case.
> 
> Best,
> tison.
> 
> 
> Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> 于2019年3月28日周四 上午8:29写道:
> What Cameron writes is correct. The only other thing to be concerned with is 
> large GC times and small session times. We have a Tech Note on this: 
> https://cwiki.apache.org/confluence/display/CURATOR/TN10 
> <https://cwiki.apache.org/confluence/display/CURATOR/TN10>
> 
> -Jordan
> 
>> On Mar 27, 2019, at 7:15 PM, Cameron McKenzie > <mailto:cammcken...@apache.org>> wrote:
>> 
>> hey ZiLi,
>> In point 2, the deletion of the path zNode will not occur under normal 
>> conditions, only if some other application removed the zNode. The contract 
>> specified by Curator is that this zNode is under the control of Curator and 
>> the behaviour of the recipe can't be guaranteed if this contract is broken. 
>> 
>> Having said that, it is quite possible at point 2 for contender-1 to just 
>> lose its connection to ZK. As soon as this occurs contender 1 would get a 
>> SUSPENDED event and would revoke leadership locally. 
>> Contender 2 will not gain leadership until the leadership zNode of contender 
>> 1 is removed (this will happen once contender 1's session times out)
>> One contender-1 reconnects to ZK, it will delete its existing zNode and then 
>> create a new zNode.
>> 
>> So, from the perspective of Curator, there should not be a situation where 
>> there are multiple leaders. You need to ensure that your application 
>> responds to the Curator callbacks of leadership acquisition / revocation in 
>> a timely manner though, otherwise your application may be in a state where 
>> there are multiple leaders.
>> cheers
>> 
>> 
>> 
>> On Thu, Mar 28, 2019 at 10:43 AM Zili Chen > <mailto:wander4...@gmail.com>> wrote:
>> Hi Curators,
>> 
>> Any advice, or it is just not the case?
>> 
>> If the latter, I'd like to learn how to reason that.
>> 
>> Best,
>> tison.
>> 
>> 
>> ZiLi Chen mailto:wander4...@gmail.com>> 于2019年3月24日周日 
>> 上午8:54写道:
>> [1] http://zookeeper.apache.org/doc/r3.4.13/recipes.html#sc_leaderElection 
>> <http://zookeeper.apache.org/doc/r3.4.13/recipes.html#sc_leaderElection>
>> ZiLi Chen mailto:wander4...@gmail.com>> 于2019年3月24日周日 
>> 上午8:53写道:
>> Hi Curators,
>> 
>> While using LeaderLatch recipe for leader election, I notice that we use 
>> #isL

Re: [DISCUSS] CURATOR-533

2019-07-28 Thread Jordan Zimmerman
Sure - makes sense Cameron..

When CURATOR-533 was written, each ConnectionStateListener created was mapped a 
new CircuitBreaker instance. In hindsight, this doesn't make sense. Only a 
single, shared CircuitBreaker is needed. Here's the email that the Elastic 
engineer sent to me originally that started this:

"We enabled circuit breaker in production a month ago and everything has 
been working 
fine so far

Also, we implemented a retry policy that adds some jitter: we specify 
minimum and maximum 
allowed interval and every retry it picks a random value within the allowed 
range.

However, while we were testing the feature we found out that it works a 
little bit differently 
than we expected: it creates a circuit breaker per a connection state 
listener. As long as we use 
the retry policy with jitter, the fact that there are many circuit breakers 
can potentially 
introduce some races between components."

-Jordan

> On Jul 28, 2019, at 7:28 PM, Cameron McKenzie  wrote:
> 
> hey Jordan,
> Can you expand on why the previous implementation was not ideal?
> 
> Given that the decorator approach was only introduced recently, I don't
> imagine it has had a lot of uptake, but it may be worth asking the question
> on the curator-user list to work out if there's going to be any impact in
> removing the existing implementation?
> cheers
> 
> On Mon, Jul 29, 2019 at 2:50 AM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>>
> wrote:
> 
>> Hi Committers,
>> 
>> CURATOR-505 introduced circuit breaking behavior via
>> CircuitBreakingConnectionStateListener and
>> ConnectionStateListenerDecorator. Elastic has been using it with success
>> but reports that the implementation can be improved. The existing
>> implementation uses a new CircuitBreaker for each ConnectionStateListener
>> set in a Curator client. It turns out that this is not ideal. Instead, a
>> shared CircuitBreaker should be used per Curator client.
>> 
>> Unfortunately, the best way to do this is to remove the
>> ConnectionStateListenerDecorator semantics and use a different mechanism.
>> This Issue proposes to do this and remove ConnectionStateListenerDecorator.
>> This is a breaking change but given the short amount of time it's been in
>> Curator it's unlikely that it's been widely adopted.
>> If the community considers a breaking change too harsh the older classes
>> can be maintained for a while and marked as @Deprecated. Otherwise we can
>> make the next release 4.3.0 (note: our semantic versioning has always been
>> wrong - different issue) to denote a breaking change.
>> 
>> I have a candidate PR here: https://github.com/apache/curator/pull/320 <
>> https://github.com/apache/curator/pull/320 
>> <https://github.com/apache/curator/pull/320>> - the Jira issue is:
>> https://issues.apache.org/jira/browse/CURATOR-533 
>> <https://issues.apache.org/jira/browse/CURATOR-533> <
>> https://issues.apache.org/jira/browse/CURATOR-533 
>> <https://issues.apache.org/jira/browse/CURATOR-533>>
>> 
>> -Jordan



Re: Improving exception classification for retries

2019-07-22 Thread Jordan Zimmerman
> what would you say about making it fully spring-retry based

I'm -1 on adding a dependency to Spring.

> contributing some improvements about Curator's retry logic

That would be great and very much appreciated.

> What is your policy on breaking changes?

We've always tried to be backward compatible. In this instance, I don't think 
breaking changes are needed. Internally, Curator always uses 
"RetryLoop.callWithRetry" which in turn calls 
"client.getConnectionHandlingPolicy().callWithRetry(client, proc)". So, you can 
already write your own "ConnectionHandlingPolicy" (or extend Curator's 
StandardConnectionHandlingPolicy) and write your own version of "callWithRetry".

Other than that, I don't think it would break anything to make the ctor for 
"RetryLoop" public and make it the implementation pluggable (via 
CuratorFrameworkFactory). So, that's also a possibility.

-Jordan

> On Jul 22, 2019, at 6:26 AM, Krajcsovszki, Gergely 
>  wrote:
> 
> Dear Curator devs,
> 
> We are considering contributing some improvements about Curator's retry 
> logic, specifically how exceptions are classified as retriable and 
> non-retriable, and would like to discuss what kind of changes you would be 
> open to.
> 
> Our immediate goal is to make it configurable whether to retry DNS issues or 
> not (ideally to set up a separate RetryPolicy for them), but if we are 
> already touching it I think it would be best to make the list of retriable 
> exceptions entirely configurable, or even to make it into a function of 
> exception to RetryPolicy.
> 
> Since your retry handling is very similar to spring-retry, what would you say 
> about making it fully spring-retry based? (So separately adjustable retry 
> policies which ultimately decide whether to retry a specific Throwable or 
> not, separately adjustable backoff policies, optional recovery callbacks for 
> Throwables that end up not being retried and user callbacks that allow 
> interfering with the retry decisions as well.)
> 
> If not, where should these adjustments go? Can the exception classification 
> move to the RetryPolicy interface from the RetryLoop? Or should it stay in 
> the RetryLoop and be adjustable via static methods?
> 
> What is your policy on breaking changes?
> 
> Thanks,
> 
> Gergely Krajcsovszki
> 
> 
> 
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions 
> or views contained herein are not intended to be, and do not constitute, 
> advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform 
> and Consumer Protection Act. If you have received this communication in 
> error, please destroy all electronic and paper copies and notify the sender 
> immediately. Mistransmission is not intended to waive confidentiality or 
> privilege. Morgan Stanley reserves the right, to the extent permitted under 
> applicable law, to monitor electronic communications. This message is subject 
> to terms available at the following link: 
> http://www.morganstanley.com/disclaimers  If you cannot access these links, 
> please notify us by reply message and we will send the contents to you. By 
> communicating with Morgan Stanley you consent to the foregoing and to the 
> voice recording of conversations with personnel of Morgan Stanley.



Re: Curator client getting suspended before zookeeper session timeout

2019-05-13 Thread Jordan Zimmerman
"Suspended" is sent as soon as the connection is lost. "Lost" is sent when the 
session timeout expires. This is documented here: 
http://curator.apache.org/errors.html  
(see Notifications). Also, this Tech Note has some details: 
https://cwiki.apache.org/confluence/display/CURATOR/TN12 
 as well as this one: 
https://cwiki.apache.org/confluence/display/CURATOR/TN14 
 

-Jordan

> On May 13, 2019, at 1:45 AM, Göktürk Gezer  wrote:
> 
> Hello,
> 
> While using LeaderSelector recipe, I'm having an issue where the curator 
> session is getting suspended after not receiving a ping response for a period 
> of time that is less than the configured zookeeper session timeout.
> 
> I'm using Curator 4.0.1, and Zookeeper 3.4.9.
> Configured zookeeper session timeout is 60 seconds.
> 
> In our test cluster, I'm observing a steady ping between client and server 
> which usually takes 13 seconds. However while under high CPU utilization, 
> ping doesn't reach server and that causes immediate session suspension, which 
> results in a change in leader. So the session is getting suspended ~30 
> seconds after the last successful ping.
> 
> My questions are:
> Is it expected for a curator to suspend the connection before zookeeper 
> session timeout?
> What role a curator framework retry policy play in a standard heart beat? (I 
> use ExponentialBackoff with 3 retries)
> 
> Regards,
> Gokturk



Re: LeaderLatch recipe could run into two leaders

2019-03-27 Thread Jordan Zimmerman
What Cameron writes is correct. The only other thing to be concerned with is 
large GC times and small session times. We have a Tech Note on this: 
https://cwiki.apache.org/confluence/display/CURATOR/TN10 


-Jordan

> On Mar 27, 2019, at 7:15 PM, Cameron McKenzie  wrote:
> 
> hey ZiLi,
> In point 2, the deletion of the path zNode will not occur under normal 
> conditions, only if some other application removed the zNode. The contract 
> specified by Curator is that this zNode is under the control of Curator and 
> the behaviour of the recipe can't be guaranteed if this contract is broken. 
> 
> Having said that, it is quite possible at point 2 for contender-1 to just 
> lose its connection to ZK. As soon as this occurs contender 1 would get a 
> SUSPENDED event and would revoke leadership locally. 
> Contender 2 will not gain leadership until the leadership zNode of contender 
> 1 is removed (this will happen once contender 1's session times out)
> One contender-1 reconnects to ZK, it will delete its existing zNode and then 
> create a new zNode.
> 
> So, from the perspective of Curator, there should not be a situation where 
> there are multiple leaders. You need to ensure that your application responds 
> to the Curator callbacks of leadership acquisition / revocation in a timely 
> manner though, otherwise your application may be in a state where there are 
> multiple leaders.
> cheers
> 
> 
> 
> On Thu, Mar 28, 2019 at 10:43 AM Zili Chen  > wrote:
> Hi Curators,
> 
> Any advice, or it is just not the case?
> 
> If the latter, I'd like to learn how to reason that.
> 
> Best,
> tison.
> 
> 
> ZiLi Chen mailto:wander4...@gmail.com>> 于2019年3月24日周日 
> 上午8:54写道:
> [1] http://zookeeper.apache.org/doc/r3.4.13/recipes.html#sc_leaderElection 
> 
> ZiLi Chen mailto:wander4...@gmail.com>> 于2019年3月24日周日 
> 上午8:53写道:
> Hi Curators,
> 
> While using LeaderLatch recipe for leader election, I notice that we use 
> #isLeader and #notLeader as callback function when a contender 
> become/un-become a leader. The implementation of leader election following 
> the description of ZooKeeper recipes[1].
> 
> However, since there is latency between the election node be deleted(lost 
> leadership) and the LeaderLatch got notified and the Listener get notified. 
> It is possible that
> 
> 1. contender-1 gained leadership
> 2. ourPath of contender-1 got deleted/lost, and contender-2 gained leadership
> 3. contender-1 was notified that it was no long the leader.
> 
> among the interval between 2 and 3, both contender-1 and contender-2 would 
> think that it itself was the leader and performed actions. Is it a known 
> issue that we should tolerant or there is any approach we can ensure that 
> actions(e.g., writing on ZooKeeper) can only be committed if the proposing 
> contender is the only current leader?
> 
> Best,
> tison.



[ANNOUNCE] Apache Curator 4.2.0 released

2019-03-06 Thread Jordan Zimmerman
Hello,

The Apache Curator team is pleased to announce the  release of version
4.2.0. Apache  Curator is a Java/JVM client library for Apache
ZooKeeper[1], a distributed coordination service. Apache Curator includes a
high-level API framework and utilities to make using Apache ZooKeeper much
easier and more reliable. It also includes recipes for common use cases and
 extensions such as service discovery and a Java 8 asynchronous DSL. For
more details, please visit the project website: http://curator.apache.org/

The download page for Apache Curator is here:
https://cwiki.apache.org/confluence/display/CURATOR/Releases

The binary artifacts for Curator are available from Maven Central and its
mirrors.

For general information on Apache Curator, please visit the project website:
http://curator.apache.org

Release Notes - Apache Curator - Version 4.2.0

** Bug
* [CURATOR-481] - Remove jackson-mapper-asl-version and update to
latest version of jackson
* [CURATOR-498] - Protected Mode creation can mistake closing session's
node causing problems for many recipes such as LeaderLatch
* [CURATOR-509] - Incompatible with Java 11

** New Feature
* [CURATOR-505] - A circuit breaking ConnectionStateListener would be
very helpful

** Improvement
* [CURATOR-501] - TestRemoveWatches has expectedExceptions list with ZK
3.5 only class
* [CURATOR-503] - Update dependencies in January 2019

** Task
* [CURATOR-500] - Apache is requiring us to move to Gitbox

Regards,

The Curator Team

[1] Apache ZooKeeper https://zookeeper.apache.org/


Re: LeaderLatch random selection

2019-02-27 Thread Jordan Zimmerman
> the order in which the participants contend for the lock

The order that processes contend is essentially random. Whichever client 
connects first gets the next ephemeral node. It's not deterministic.

-JZ

> On Feb 27, 2019, at 2:55 PM, Patrick Peralta  wrote:
> 
> Hello!
> 
> I have a question about the JavaDoc for LeaderLatch:
> 
> https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/leader/LeaderLatch.html
> 
> It states "If a group of N thread/processes contend for leadership one
> will randomly be assigned leader until it releases leadership at which
> time another one from the group will randomly be chosen". However,
> when reading the source it appears to me that leadership is actually
> assigned in the order in which the participants contend for the lock
> since it uses ephemeral sequential nodes and sorts them accordingly.
> Am I misunderstanding the code or the meaning of the word "randomly"
> in the JavaDoc?
> 
> Thanks!



[DISCUSS] Release 4.2.0

2019-02-13 Thread Jordan Zimmerman
Hey,

CURATOR-498 is a critical problem and needs to be released ASAP. So, I'd like 
to prep a release of 4.2.0. There's not much else in this release (CURATOR-505 
is new). If anyone wants anything else in 4.2.0 speak up now and act on the 
corresponding open PR. 

-Jordan

IMPORTANT CURATOR-498

2019-01-01 Thread Jordan Zimmerman
A very serious edge case has been discovered and is described in 
http://issues.apache.org/jira/browse/CURATOR-498 
. There is a PR for it now: 
https://github.com/apache/curator/pull/299 
. The fix in the PR makes a few 
subtle changes which shouldn't affect current Curator clients but I'd feel 
better about it if users can review and test with their applications. So, this 
is a call to please test branch CURATOR-498 if you can. Also, we'd appreciate 
as many eyes as possible on the PR.

Thank you.

-Jordan

Re: Release 4.1.0 planning

2018-12-07 Thread Jordan Zimmerman
I'll take a look and see if I can resolve it. 

-JZ

> On Dec 7, 2018, at 2:23 PM, Borja Bravo  wrote:
> 
> Hi,
> 
> A long to time ago I reported this bug with a pull request.
> https://issues.apache.org/jira/browse/CURATOR-444 
> <https://issues.apache.org/jira/browse/CURATOR-444>
> 
> The fix is just one line but was rejected for not having tests. The second 
> time I added some test but they were slow and not very deterministic and were 
> rejected.
> It would be great if we could have  4.1.0 with this recipe fixed. I tried to 
> understand the inners of curator to improve the test but desisted after some 
> time.  Currently we patch it locally on our systems. 
> 
> Anyway thanks for the work and it is nice to see that Curator is alive and 
> healthy
> 
> Regards,
> 
> Borja
> 
> On Thu, Dec 6, 2018 at 7:02 PM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> Hello Folks,
> 
> It's time to do a Curator release. The next release will be 4.1.0 and will 
> contain these changes:
> 
>   
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12314425=12342759
>  
> <https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12314425=12342759>
> 
> Please speak now if there is an outstanding issue that you'd like to see make 
> it in. There are a bunch of PRs that could use a lot more eyes, test, etc.
> 
> -Jordan
> 
> 
> -- 
> Borja Bravo
> Software Developer
> 
> 
> 
> borja.br...@darwinex.com <mailto:borja.br...@darwinex.com>
> 
> CFDs are complex instruments and come with a high risk of losing money 
> rapidly due to leverage. 73% of retail investor accounts lose money when 
> trading CFDs with this provider. You should consider whether you understand 
> how CFDs work and whether you can afford to take the high risk of losing your 
> money.
> The Darwinex® trademark and the www.darwinex.com <http://www.darwinex.com/> 
> domain are owned by Tradeslide Trading Tech Limited, a company duly 
> authorised and regulated by the Financial Conduct Authority (FCA) in the 
> United Kingdom with FRN 586466. Our Company number is 08061368 and our 
> registered office is Acre House, 11-15 William Road, London NW1 3ER, UK 
> <https://maps.google.com/?q=11-15+William+Road,+London+NW1+3ER,+UK=gmail=g>.
>  
> 
> 



Re: Possible deadlock in blockUntilConnected?

2018-12-07 Thread Jordan Zimmerman
A few things to try:

Integer.MAX_VALUE is not very useful for the session/connection timeouts. Try 
reasonable numbers.
The call to client.blockUntilConnected() isn't needed, remove it

Other than that, I'd need to see logs to see why ZooKeeper itself isn't 
connecting.

-JZ

> On Dec 7, 2018, at 1:36 PM, Alfredo Gimenez  wrote:
> 
> Absolutely, I just wanted to give the high-level description in case this was 
> a clear anti-pattern (multiple threads connecting concurrently to ZK on 
> localhost).
> 
> Each thread has its own Curator client right now because of the design of 
> Kafka Connect--Connect "tasks", which run on separate threads, are meant to 
> run independently of each other (no shared state in the VM). I'll see if it's 
> possible to modify them to share a client--if that's necessary, does that 
> mean client initialization is not thread-safe?
> 
> The thread dump of the deadlocked tasks (all have this same dump):
> 
> "pool-1-thread-1" #64 prio=5 os_prio=0 tid=0x2ab5c0062800 nid=0x5b2d in 
> Object.wait() [0x2ab5a520c000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> org.apache.curator.framework.state.ConnectionStateManager.blockUntilConnected(ConnectionStateManager.java:224)
>   - locked <0x2aac5183c9a0> (a 
> org.apache.curator.framework.state.ConnectionStateManager)
>   at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:272)
>   at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.blockUntilConnected(CuratorFrameworkImpl.java:278)
>   at 
> gov.llnl.sonar.kafka.connect.offsetmanager.FileOffsetManager.(FileOffsetManager.java:72)
>   at 
> gov.llnl.sonar.kafka.connect.connectors.DirectorySourceTask.start(DirectorySourceTask.java:89)
>   at 
> org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:182)
>   at 
> org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
>   at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 
> And my Curator code in FileOffsetManager.java, which does little more than 
> create a very persistent client with a simple listener 
> (DirectorySourceTask.java doesn't do any Curator stuff, here it is just 
> creating a FileOffsetManager instance with a provided zookeeper host/port):
> client = CuratorFrameworkFactory.newClient(
> zooKeeperHost + ":" + zooKeeperPort,
> Integer.MAX_VALUE,
> Integer.MAX_VALUE,
> new RetryForever(1000));
> 
> FileOffsetManager thisRef = this;
> client.getConnectionStateListenable().addListener(new 
> ConnectionStateListener() {
> @Override
> public void stateChanged(CuratorFramework client, ConnectionState 
> newState) {
> if (!newState.isConnected()) {
> log.warn("Thread {}: Curator state changed to {} with contents: 
> {}", threadID, newState.toString(), thisRef.toString());
> }
> }
> });
> client.start();
> client.blockUntilConnected(); // Sometimes we get stuck here...
> 
> And my test code does the same without a listener:
> CuratorFramework client = CuratorFrameworkFactory.newClient(
> zooKeeperHost + ":" + zooKeeperPort,
> Integer.MAX_VALUE,
> Integer.MAX_VALUE,
> new RetryForever(1000));
> 
> client.start();
> client.blockUntilConnected();
> 
> On Fri, Dec 7, 2018 at 5:21 AM Jordan Zimmerman  <mailto:jor...@jordanzimmerman.com>> wrote:
> There isn't much to go on in your description. Please send some sample code, 
> logs, possibly a thread dump. Maybe send your test program. One thing that 
> sticks out is that you say each thread has its own Curator client. Why is 
> that? You only need 1 Curator client per ZK ensemble in a VM.
> 
> -Jordan
> 
>> On Dec 6, 2018, at 4:42 PM, Alfredo Gimenez > <mailto:alfredo.gime...@gmail.com>> wrote:
>> 
>> I ran into what looks like a deadlock in blockUntilConnected and wanted to 
>> give a high-level description in case someone can help me debug the issue. I 
>> can try to make a reproducible example, but for reasons that will be 
>> apparent,

Re: Possible deadlock in blockUntilConnected?

2018-12-07 Thread Jordan Zimmerman
There isn't much to go on in your description. Please send some sample code, 
logs, possibly a thread dump. Maybe send your test program. One thing that 
sticks out is that you say each thread has its own Curator client. Why is that? 
You only need 1 Curator client per ZK ensemble in a VM.

-Jordan

> On Dec 6, 2018, at 4:42 PM, Alfredo Gimenez  wrote:
> 
> I ran into what looks like a deadlock in blockUntilConnected and wanted to 
> give a high-level description in case someone can help me debug the issue. I 
> can try to make a reproducible example, but for reasons that will be 
> apparent, that's not straightforward.
> 
> I am using Curator within a custom Kafka Connect source. As a result, I have 
> a process per node on 11 nodes, and up to 12 tasks (threads) per node, each 
> with its own Curator client. Every node is also running zookeeper, so I 
> initialize the Curator clients by pointing to localhost:2181. On 9 nodes, 
> everything works perfectly, but on the other 2, all tasks seem to hang at 
> blockUntilConnected (specifically here: 
> https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224
>  
> ).
>  I found this by observing no activity in my Kafka Connect logs and grabbing 
> a stacktrace via jstack on the offending nodes.
> 
> I also made a small test program that just initializes a client and runs 
> blockUntilConnected (nothing else) and ran it at the same time, and it also 
> hangs there forever. Meanwhile, I can use zookeeper-shell on localhost just 
> fine, and if I initialize a Curator client pointing to one of the other nodes 
> (not localhost) the Curator client initializes fine. 
> 
> Is this a possible deadlock from initializing Curator clients across multiple 
> threads concurrently?



Re: typed model with multiple classes

2018-09-07 Thread Jordan Zimmerman
No, not today. You would create two models, one for the /groups/{groupId} model 
and one for the user.

-JZ

> On Sep 7, 2018, at 12:31 PM, Hendrik Haddorp  wrote:
> 
> Hi,
> 
> is it possible to create a typed model that contains hierarchical objects of 
> different types? For example if I have user and group objects and want to 
> store them like this:
> /groups/{groupId}/users/{userId}
> 
> Using TypedModeledFramework I could set a type for the parameters in the 
> ZPath but the objects returned from the model are always the same. I would 
> like to get Group objects on the group level and User objects on the user 
> level. The objects should only contain data from their level.
> 
> I got this working by having two typed models and then only load data on the 
> correct position. This is however a bit error prone and if I would use a 
> cached model I would end up having all nodes in both caches. Thus I would be 
> interested in a better solution.
> 
> thanks,
> Hendrik



Re: org.apache.curator.x.async.modeled.details.CachedModeledFrameworkImpl.children()

2018-09-07 Thread Jordan Zimmerman
Wow - I think this is a bug. If you don't mind, please open an issue in 
Curator's Jira

> On Sep 7, 2018, at 8:02 AM, Hendrik Haddorp  wrote:
> 
> Hi,
> 
> org.apache.curator.x.async.modeled.details.CachedModeledFrameworkImpl.children()
>  and 
> org.apache.curator.x.async.modeled.details.CachedModeledFrameworkImpl.childrenAsZNodes()
>  do not seem to work. This filter condition looks wrong to me:
> 
> .filter(path -> path.equals(cache.basePath()))
> 
> Getting the children on an uncached model works just fine but on a cached 
> model I always get an empty list. The list that 
> cache.currentChildren(client.modelSpec().path()) returns within the methods 
> looks correct but then there is this strange additional path filtering that 
> throws away everything.
> 
> This seems to be the test code for that class. I don't fully understand the 
> tests but it looks like the children calls are not tested.
> 
> Shall I open a defect or is this enough?
> 
> regards,
> Hendrik



Re: Rationale for CuratorFramework default value

2018-08-29 Thread Jordan Zimmerman
> it seems I'm becoming the Curator guy ;-)

Sorry to hear it :D That said, we need more committers. Let me know if you're 
interested.

-JZ

> On Aug 29, 2018, at 10:53 AM, Sylvain Wallez  wrote:
> 
> Thanks Jordan!
> 
> Yep, all good at Elastic where it seems I'm becoming the Curator guy ;-)
> 
> I've found a few other places where some defensive constructs could be 
> avoided or things could be optimized, I'll prepare a PR with these soon.
> 
> Sylvain
> 
> Le 29/08/2018 à 17:45, Jordan Zimmerman a écrit :
>> Hi Sylvain :D - I hope all is well at Elastic
>> 
>>> I just found out that the default value for 
>>> `CuratorFrameworkImpl.getDefaultData()` is the string representation of the 
>>> local IP address [1].
>> As I recall, it was done as a debugging aid. Putting the IP address allows 
>> someone to know which client created the ZNode.
>> 
>>> Is there anything fundamentally wrong with having no data on nodes?
>> Nothing wrong with it. It's perfectly fine.
>> 
>>> (and BTW the defensive copy in [2] could be avoided, since 
>>> `Builder.defaultData` already does it in [3])
>> Pull Requests are appreciated :D
>> 
>> -Jordan
>> 
>>> On Aug 29, 2018, at 10:39 AM, Sylvain Wallez  wrote:
>>> 
>>> Hi all, first email on this list!
>>> 
>>> I just found out that the default value for 
>>> `CuratorFrameworkImpl.getDefaultData()` is the string representation of the 
>>> local IP address [1].
>>> 
>>> I just spent some time figuring out where these suprising "ghost values" 
>>> were coming from when creating parent nodes in ZK, so I'm curious about why 
>>> this was chosen rather than an empty byte array, or even null?
>>> 
>>> Also, if we call `CuratorFrameworkFactory.builder().defaultData(null)` then 
>>> the actual default data is byte[0] (see [2]). Is there anything 
>>> fundamentally wrong with having no data on nodes? Using 
>>> `forPath("some-path", null)` seems to work just fine.
>>> 
>>> (and BTW the defensive copy in [2] could be avoided, since 
>>> `Builder.defaultData` already does it in [3])
>>> 
>>> Thanks for any explanation on the rationale behind this,
>>> Sylvain
>>> 
>>> [1] 
>>> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/CuratorFrameworkFactory.java#L141
>>> [2] 
>>> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CuratorFrameworkImpl.java#L158
>>> [3] 
>>> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/CuratorFrameworkFactory.java#L255
>>> 
> 



Re: Rationale for CuratorFramework default value

2018-08-29 Thread Jordan Zimmerman
Hi Sylvain :D - I hope all is well at Elastic

> I just found out that the default value for 
> `CuratorFrameworkImpl.getDefaultData()` is the string representation of the 
> local IP address [1].

As I recall, it was done as a debugging aid. Putting the IP address allows 
someone to know which client created the ZNode. 

> Is there anything fundamentally wrong with having no data on nodes?

Nothing wrong with it. It's perfectly fine.

> (and BTW the defensive copy in [2] could be avoided, since 
> `Builder.defaultData` already does it in [3])

Pull Requests are appreciated :D

-Jordan

> On Aug 29, 2018, at 10:39 AM, Sylvain Wallez  wrote:
> 
> Hi all, first email on this list!
> 
> I just found out that the default value for 
> `CuratorFrameworkImpl.getDefaultData()` is the string representation of the 
> local IP address [1].
> 
> I just spent some time figuring out where these suprising "ghost values" were 
> coming from when creating parent nodes in ZK, so I'm curious about why this 
> was chosen rather than an empty byte array, or even null?
> 
> Also, if we call `CuratorFrameworkFactory.builder().defaultData(null)` then 
> the actual default data is byte[0] (see [2]). Is there anything fundamentally 
> wrong with having no data on nodes? Using `forPath("some-path", null)` seems 
> to work just fine.
> 
> (and BTW the defensive copy in [2] could be avoided, since 
> `Builder.defaultData` already does it in [3])
> 
> Thanks for any explanation on the rationale behind this,
> Sylvain
> 
> [1] 
> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/CuratorFrameworkFactory.java#L141
> [2] 
> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CuratorFrameworkImpl.java#L158
> [3] 
> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/CuratorFrameworkFactory.java#L255
> 



Re: NodeCacheListener Fails to Get Called

2018-07-16 Thread Jordan Zimmerman
There's no issue with NodeCache that I know of. I would think that if there was 
we'd have heard it by now. But, who knows. If you can provide a sample that 
shows the problem it would help.

-Jordan

> On Jul 16, 2018, at 9:07 PM, ? ??  wrote:
> 
> Jordan, thanks for your suggestion.
> 
> My question is whether it is possible that NodeCacheListener miss a node 
> change event. If it is, when and how does it happen? I am using it in Storm's 
> Bolts. Each Bolt task has a NodeCacheListener and all the NodeCacheListeners 
> watch the same ZNode. But it is observed that only some NodeCacheListeners 
> got invoked while others did not.
> 
> wangchunc...@outlook.com <mailto:wangchunc...@outlook.com>
>  
> From: Jordan Zimmerman <mailto:jor...@jordanzimmerman.com>
> Date: 2018-07-17 03:06
> To: user <mailto:user@curator.apache.org>
> Subject: Re: NodeCacheListener Fails to Get Called
> You haven't given any information at all. Please provide a complete question 
> and maybe someone can help.
> 
> -Jordan
> 
>> On Jul 16, 2018, at 3:57 AM, ? ?? > <mailto:wangchunc...@outlook.com>> wrote:
>> 
>> Hi all,
>> 
>> I am using NodeCacheListener in Storm. A new NodeCache instance is created 
>> in each Bolt and started. And a NodeCacheListener is registered to the 
>> NodeCache instance. But it is observed that some NodeCacheListener did not 
>> get invoked. Did anyone of you ever encounter this issue? Any help is 
>> appreciated. Thanks in advance.
>> 
>> wangchunc...@outlook.com <mailto:wangchunc...@outlook.com>


[DISCUSS] Moving to Java 8

2018-06-24 Thread Jordan Zimmerman
Hey Folks,

Any objects to making Java 8 the minimum for Curator? ZooKeeper is now Java 8 
minimum. I've opened an issue to do this: 
https://issues.apache.org/jira/browse/CURATOR-471 


If I don't here any objects by the end of the week I'll commit this.

-Jordan

Re: InterProcessMutex error handling

2018-06-21 Thread Jordan Zimmerman
TBH - I don't know why we never created a utility to automate this. We should 
add a "LockConnectionStateListener" or something that listens for connection 
state changes and tries to interrupt the locked thread when there's an issue. 
Until we have that though, you'd have to do this manually. We have 
"LeaderSelectorListenerAdapter" for LeaderSelector. You can look at that as an 
example of how to handle things.

-Jordan

> On Jun 21, 2018, at 9:32 AM, Jakub Ječmínek  wrote:
> 
> Hello,
> I am using curator's InterProcessMutexes and I have found in the 
> documentation this note on error handling:
> It is strongly recommended that you add a ConnectionStateListener and watch 
> for SUSPENDED and LOST state changes. If a SUSPENDED state is reported you 
> cannot be certain that you still hold the lock unless you subsequently 
> receive a RECONNECTED state. If a LOST state is reported it is certain that 
> you no longer hold the lock. 
> 
> Is there some best practice how to handle this? I have not found any official 
> example of handling the interprocessmutex, when connection to zookeeper is 
> lost. 
> 
> I guess I would have to have a separate boolean (probably 
> AtomicReference) that would indicate if connection to zookeeper is 
> present, because when connection is lost the isOwnedByCurrentThread and 
> isAcquiredInThisProcess methods still returns true.
> This boolean reference would set in stateChanged method. Is this assumption 
> correct?
> 
> I would be very grateful for any hints and help.
> with best regards
> Jakub Jecminek
> 



Re: Unexpected session expired event

2018-03-28 Thread Jordan Zimmerman
We'd need to see the ZooKeeper logs to know why you're losing your session. 
Also, you may have more luck on the ZooKeeper mailing list.

-Jordan

> On Mar 29, 2018, at 6:27 AM, Cameron McKenzie  wrote:
> 
> hey Joe,
> Curator just wraps up the Zookeeper client. It is responsible for sending 
> heart beats etc, which it does in its own thread. If you're running hundreds 
> of threads within your VM, perhaps it's possible that the client ZK thread is 
> getting starved, it's difficult to know.
> 
> cheers
> 
> On Thu, Mar 29, 2018 at 8:16 AM, Joe Naegele  > wrote:
> Cameron,
> 
>  
> 
> Thanks, this is very helpful.
> 
>  
> 
> My program is not experiencing extreme GC events, and it doesn’t appear the 
> ZK server is either (as expected). I have a feeling the connection loss is 
> due to the client being starved of CPU time. My main program spawns hundreds 
> of threads with different roles, then waits on them in succession, so the 
> main thread effectively sleeps (BLOCKED) for most of its lifetime. I’m 
> curious though, how does the Curator framework send/receive heartbeats once 
> started? I had assumed it has its own thread pool and runs “in the 
> background”, regardless of what my main thread and any other threads do.
> 
>  
> 
> I follow your suggestion to read/write additional state for tracking 
> persistent lock ownership. Sounds reasonable.
> 
>  
> 
> Thanks again
> 
> ---
> 
> Joe Naegele
> 
> Grier Forensics
> 
>  
> 
> From: Cameron McKenzie [mailto:mckenzie@gmail.com 
> ] 
> Sent: Wednesday, March 28, 2018 4:37 PM
> To: user@curator.apache.org 
> Subject: Re: Unexpected session expired event
> 
>  
> 
> hey Joe,
> 
> The session timeout that you've outlined in the Curator code is 120 seconds, 
> but the negotiated timeout is 40 seconds. This is presumably because the tick 
> time on your Zookeeper server is set to 2 seconds (the maximum session 
> timeout is 20 * configured tick time).
> 
>  
> 
> As to why your're seeing the connection loss, it's very difficult to know. It 
> could be that Zookeeper or your clients are being starved of CPU time, 
> meaning that heartbeats aren't being sent / received. Or, it could be that 
> you're getting 'stop the world' GC events in the JVM.
> 
>  
> 
> In regards to the lock acquisition, if you're trying to ensure that the 
> second client doesn't run until the first client completes its work (rather 
> than just the first client no longer holding a lock) then you probably need 
> to write some state to ZK to communicate this. The state used by the locks is 
> transient, so will disappear if the lock holders session is closed.
> 
> cheers
> 
>  
> 
>  
> 
>  
> 
> On Thu, Mar 29, 2018 at 7:14 AM, Joe Naegele  > wrote:
> 
> Good afternoon,
> 
>  
> 
> I’m experiencing an odd issue with ZooKeeper/Curator. I’m using ZooKeeper 
> 3.5.2-alpha and Curator 4.0.0. My issue may be a ZooKeeper problem, but I 
> want to understand whether Curator is involved, or how Curator might avoid 
> the problem.
> 
>  
> 
> I’ve configured my client (framework) as follows:
> 
>  
> 
> client = CuratorFrameworkFactory.builder()
> 
>   .connectString(”127.0.0.1:2181 ”)
> 
>   .sessionTimeoutMs(12)
> 
>   .connectionTimeoutMs(3)
> 
>   .retryPolicy(new ExponentialBackoffRetry(2000, 8))
> 
>   .build()
> 
> client.start()
> 
>  
> 
> My ZooKeeper server is just one server with a “default” configuration.  Two 
> client programs run on the same server and use a variety of recipes, 
> including GroupMember, SharedCount, DistributedAtomicInteger, and, most 
> importantly, an InterProcessReadWriteLock. One of the client programs 
> performs hours of work while the other waits on the lock. While doing work, 
> the program often reports the following message:
> 
>  
> 
> 06:23:56.867 WARN  org.apache.zookeeper.ClientCnxn - Client session timed 
> out, have not heard from server in 32147ms for sessionid 0x1092fb2fc3000a9
> 
>  
> 
> and ZooKeeper immediately afterward reports:
> 
>
> 
> 06:23:56,910 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read additional 
> data from client sessionid 0x1092fb2fc3000a9, likely client has closed socket
> 
> 06:23:56,911 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean 
> [org.apache.ZooKeeperService:name0=StandaloneServer_port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
> 
> 06:23:56,912 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection 
> for client /127.0.0.1:47312  which had sessionid 
> 0x1092fb2fc3000a9
> 
> 06:23:58,447 [myid:] - INFO  
> [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296
>  

Re: Connection Factories for Curator / Zookeeper / HTTP Tunneling

2018-03-03 Thread Jordan Zimmerman
I don't know much about Cloud Foundry but how does it handle things like 
replicated databases, etc.? There are copious systems that don't do HTTP.

> On Mar 3, 2018, at 10:13 AM, Chris Miles  wrote:
> 
> Thanks Jordan,
> 
> It is a cloud-foundry cloud, the network restrictions are between nodes of 
> deployed applications. So server instances cant communicate unless I can 
> tunnel them through HTTP.
> 
> HTTP is the only protocol of communication I am able to use. This is not 
> currently changeable.
> 
> Chris
> 
>> Curator wraps the built-in ZooKeeper client, so Curator doesn't give you any 
>> benefit that isn't already present in ZooKeeper itself. You can easily use 
>> port 80 or 443 as the ZK/Curator client port. But, the ZooKeeper protocol 
>> (jute) is of course not HTTP. If the Firewall is expecting HTTP it won't 
>> work. Ports 2888/3888 are only used internally between ZooKeeper server 
>> instances. Those should all be behind a firewall so should be OK.
>> 
>> -JZ
>> 
>>> On Mar 3, 2018, at 9:25 AM, Chris Miles  wrote:
>>> 
>>> Firstly, I apologise for the cross post, but I think this is a question 
>>> which may need to be seen by both users, and devs who understand the 
>>> underlying code.
>>> 
>>> I need to deploy Zookeeper to a firewall restricted cloud-foundry cloud, 
>>> where the only communication can happen between nodes is through HTTP, so I 
>>> am looking at ways of getting ZooKeeper communicating through HTTP 
>>> tunnelling.
>>> 
>>> As far as I can determine, ZooKeeper only allows the configuring of the 
>>> main client connection via server and client connection factories, but not 
>>> for the 2888 and 3888 connectivity, which is I think ((correct me if 
>>> wrong)) node to node communication on the first one, and leader election on 
>>> the second?
>>> 
>>> Does Curator's connection handling give me any ability to intercept and 
>>> wrap the connections used for the rest of these ports? (Netty Http Tunnel).
>> 
> 



Re: Connection Factories for Curator / Zookeeper / HTTP Tunneling

2018-03-03 Thread Jordan Zimmerman
Curator wraps the built-in ZooKeeper client, so Curator doesn't give you any 
benefit that isn't already present in ZooKeeper itself. You can easily use port 
80 or 443 as the ZK/Curator client port. But, the ZooKeeper protocol (jute) is 
of course not HTTP. If the Firewall is expecting HTTP it won't work. Ports 
2888/3888 are only used internally between ZooKeeper server instances. Those 
should all be behind a firewall so should be OK.

-JZ

> On Mar 3, 2018, at 9:25 AM, Chris Miles  wrote:
> 
> Firstly, I apologise for the cross post, but I think this is a question which 
> may need to be seen by both users, and devs who understand the underlying 
> code.
> 
> I need to deploy Zookeeper to a firewall restricted cloud-foundry cloud, 
> where the only communication can happen between nodes is through HTTP, so I 
> am looking at ways of getting ZooKeeper communicating through HTTP tunnelling.
> 
> As far as I can determine, ZooKeeper only allows the configuring of the main 
> client connection via server and client connection factories, but not for the 
> 2888 and 3888 connectivity, which is I think ((correct me if wrong)) node to 
> node communication on the first one, and leader election on the second?
> 
> Does Curator's connection handling give me any ability to intercept and wrap 
> the connections used for the rest of these ports? (Netty Http Tunnel).



[ANNOUNCE] Apache Curator 4.0.1 released

2018-02-11 Thread Jordan Zimmerman
Hello,

The Apache Curator team is pleased to announce the release of version
4.0.1. Apache Curator is a Java/JVM
client library for Apache ZooKeeper[1], a distributed coordination service.
Apache Curator includes
a high-level API framework and utilities to make using Apache ZooKeeper
much easier and more reliable.
It also includes recipes for common use cases and extensions such as
service discovery and a Java 8
asynchronous DSL. For more details, please visit the project website:
http://curator.apache.org/

Link to release notes:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12314425=12341261

The most recent source release can be obtained from an Apache Mirror:
http://www.apache.org/dyn/closer.cgi/curator/
(mirror sync times may vary)

The binary artifacts for Curator are available from Maven Central and its
mirrors.

For general information on Apache Curator, please visit the project website:
http://curator.apache.org

Regards,
The Curator Team

[1] Apache ZooKeeper https://zookeeper.apache.org/


Re: [RELEASE] Release this week?

2018-02-07 Thread Jordan Zimmerman
I've gotten no feedback on this so I'm going to start a release sometime today. 
Please speak now if want any last minute PRs merged before the release.

-Jordan

> On Feb 4, 2018, at 9:38 AM, Jordan Zimmerman <jor...@jordanzimmerman.com> 
> wrote:
> 
> Hey Folks,
> 
> I'd like to put together a release for this coming week. If there are any 
> issues committers/interested parties are working on that should be in the 
> release please complete them now,
> 
> -Jordan



[RELEASE] Release this week?

2018-02-04 Thread Jordan Zimmerman
Hey Folks,

I'd like to put together a release for this coming week. If there are any 
issues committers/interested parties are working on that should be in the 
release please complete them now,

-Jordan

Re: Hello - (How to use LeaderLatch in PathChildrenCacheListener)

2017-12-21 Thread Jordan Zimmerman
Moving to @users...

There's not enough of a code snippet to answer properly. However, I'd use a 
LeaderSelector instead and allocate/use the PathChildrenCache inside of it. 

-Jordan

> On Dec 21, 2017, at 5:12 AM, 習慣/zt①个 <381954...@qq.com> wrote:
> 
> Hello:
>I'd like to launch a APP which replicas is 2. These two same APPs used 
> 'PathChildrenCache' to watch a ZK path. 
> I expect that the only one app(the leader) execute the 
> 'PathChildrenCacheListener' event. Do i need to wait the leader elect 
> completely 
> before executing 'PathChildrenCacheListener' event ?
>I worry that node changing trigger 'PathChildrenCacheListener' event 
> before electing leader completely, especially in unstable network.
> 
> [demo]:
> PathChildrenCache cache = new PathChildrenCache(client, nodePath, false);
> cache.getListenable().addListener(new PathChildrenCacheListener() {
>@Override
>public void childEvent(CuratorFramework client, PathChildrenCacheEvent 
> event) throws Exception {
>   if (event.getType() == Type.CHILD_ADDED
> || event.getType() == Type.CHILD_REMOVED) {
>  
>  // wait leader elect completely
>  while (!leaderLatch.getLeader().isLeader()) {
> try {
>TimeUnit.MILLISECONDS.sleep(1000);
> } catch (InterruptedException e) {
>Thread.currentThread().interrupt();
> }
>  }
>  
>  String node = ZKPaths.getNodeFromPath(event.getData().getPath());
>  if (leaderLatch.hasLeadership()) {
> switch (event.getType()) {
>case CHILD_ADDED:
>   log.info("Added node : {}", node);
>   // TODO: 2017/12/21
>   break;
>case CHILD_REMOVED:
>   log.info("Removed node : {}", node);
>   // TODO: 2017/12/21
>   break;
> }
>  } else {
> log.info("This LeaderLatch is not leader");
>  }
>   }
>}
> });
> cache.start();
> 



Re: Missing event in PathChildrenCache

2017-11-11 Thread Jordan Zimmerman
ZooKeeper does not guarantee that you will get every event. There are myriad 
ways that a client can miss events. If you are writing an application that 
depends on getting every event then ZooKeeper is not a good solution for you.

In your example, are the create of "/test1" and the delete of "/test1" done by 
the same client? Is the PathChildrenCache started when the create of "/test1" 
is done? I can envision this scenario:

* At T1 PathChildrenCache is started
* At T2 your test code creates "/test1"
* At T3 PathChildrenCache's watcher gets the called
* At T4 your test code deletes "/test1"
* At T5 PathChildrenCache tries to read "/test1" and it's no longer there so it 
ignores it - no events are reported

-Jordan

> On Nov 10, 2017, at 11:57 PM, Jihoon Son  wrote:
> 
> Hi,
> 
> I'm using Curator 4.0.0 and Curator-testing 2.12.0 for the Zookeeper 
> compatibility.
> 
> Recently, one of our unit tests keeps failing, and I found an weird case 
> while looking into it. The unit test essentially does followings.
> 
> - Makes a PathChildrenCache for '/' and writes something on a '/test1' in 
> background. The PathChildrenCache has an event listener for the 
> 'CHILD_REMOVED' event.
> - Deletes '/test1' transactionally.
> - Expects PathChildrenCache does something on 'CHILD_REMOVED' event.
> 
> This test works well in local, but mostly fails on travis. The problems is, 
> if the delete in the second step is performed before the actual write on 
> '/test1' is done in the first step, PathChildrenCache never receives the 
> 'CHILD_REMOVED' event. 
> 
> I'm not sure this is a bug or not. Is there any way to guarantee for 
> PathChildrenCache to receive 'CHILD_REMOVED' event properly?
> 
> Thanks,
> Jihoon



[ANNOUNCE] Apache Curator 4.0.0 released

2017-07-29 Thread Jordan Zimmerman
Hello,

The Apache Curator team is pleased to announce the release of version
4.0.0. Apache Curator is a Java/JVM
client library for Apache ZooKeeper[1], a distributed coordination service.
Apache Curator includes
a high-level API framework and utilities to make using Apache ZooKeeper
much easier and more reliable.
It also includes recipes for common use cases and extensions such as
service discovery and a Java 8
asynchronous DSL. For more details, please visit the project website:
http://curator.apache.org/

Release Summary:

Apache Curator 4.0.0 contains important bug fixes and new features.
Highlights:

- Support for TTL Nodes
- A new strongly typed, modeled DSL
- Data migration
- Unified ZooKeeper 3.4.x and 3.5.x support

Link to release notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12314425=12339847

The most recent source release can be obtained from an Apache Mirror:

http://www.apache.org/dyn/closer.cgi/curator/

(mirror sync times may vary)

The binary artifacts for Curator are available from Maven Central and its
mirrors.

For general information on Apache Curator, please visit the project website:
http://curator.apache.org

Regards,
The Curator Team

[1] Apache ZooKeeper https://zookeeper.apache.org/


DISCUSS: Remove Classic connection handling

2017-07-23 Thread Jordan Zimmerman
Background: We are preparing to release what will be Apache Curator 4.0.0. As 
usual, our tests are a bit flakey. Currently, the biggest culprit is "classic" 
connection handling. This is a special internal mode that most people probably 
don't even know about. It was added as a backward compatibility stop gap when 
the meaning of connection state LOST changed from Curator 2.x to Curator 3.x. 
With Curator 4.0, I'd like to completely remove "classic" connection handling 
unless there are objections. Here is the PR:

https://github.com/apache/curator/pull/233 


Please comment on the PR if you have any issues with this. I'll leave this open 
for 48 hours - i.e. barring objections this will get merged Wednesday morning 
US time.

-Jordan

Re: Proposal: end-of-life for Curator 2.0 and make next version 4.0.0

2017-07-22 Thread Jordan Zimmerman
I've proven that it works by running all the framework/recipe tests with the 
latest master but forcing ZooKeeper 3.4.8 as the client library and server (via 
curator-test 2.12.0).

> On Jul 22, 2017, at 2:55 AM, Cameron McKenzie  wrote:
> 
> Sounds ok to me as long as the backwards compatibility is all good



Re: Proposal: end-of-life for Curator 2.0 and make next version 4.0.0

2017-07-22 Thread Jordan Zimmerman
Thanks Cameron. I'll do the release, no worries.

-JZ

> On Jul 22, 2017, at 2:55 AM, Cameron McKenzie <cammcken...@apache.org> wrote:
> 
> Sounds ok to me as long as the backwards compatibility is all good. I'm still 
> away for another couple of days, but I can do the release sometime next week 
> if you like.
> cheers 
> 
> On Thu, Jul 20, 2017 at 5:48 AM, Jordan Zimmerman <jor...@jordanzimmerman.com 
> <mailto:jor...@jordanzimmerman.com>> wrote:
> Folks,
> 
> I was trying to backport some PRs to CURATOR-2.0 and it was very difficult. 
> So, I decided to see what it would take to make Curator 3.x work with 
> ZooKeeper 3.4.x. The result is 
> https://issues.apache.org/jira/browse/CURATOR-425 
> <https://issues.apache.org/jira/browse/CURATOR-425>. There will be a PR later 
> today (barring anything unforeseen). Given this, there's no reason to 
> continue with CURATOR-2.0. Also, given the number of changes and new features 
> in master, I propose that we make the next release Curator 4.0.0. 
> 
> Any objections/comments?
> 
> -Jordan
> 



Re: Curator support for SSL connection

2017-06-28 Thread Jordan Zimmerman
Curator uses the ZooKeeper.java client internally. So, whatever that code 
supports, Curator supports. I don't know too much about SSL for Zookeeper. 
Maybe ask on the ZooKeeper list unless someone else here knows more.

-Jordan

> On Jun 28, 2017, at 1:36 AM, Srikanth Hugar  wrote:
> 
> Hello,
> 
>   I am newbie to curator and i am using 2.10.0 version, wanted to know 
> whether curator support SSL connection with SSL tunnel?
> 
> I tried to find information, but could not get relevant info.
> 
> Appreciate if i get some pointers or docs related to SSL connection support 
> if it supports.
> 
> Thank you.
> 
> Best Regards,
> Srikanth.



Re: distributed locking issue

2017-04-07 Thread Jordan Zimmerman
A few things:

* What's the purpose of "numLocks"? It's always 1 (as it should be)

* On line 59 it should be: long end = System.nanoTime();

* System.nanoTime() "provides nanosecond precision, but not necessarily 
nanosecond resolution". Your lock is immediately released and, thus, you cannot 
really rely on the clocks being right. See: 
http://www.principiaprogramatica.com/?p=16 


* I'm not sure this actually tests anything. The safest way to judge whether 
two threads hold the lock is to introduce an AtomicBoolean in your code. E.g, 
on line 57 add something like:

if ( !debugIsLocked.compareAndSet(false, true) ) {
throw new IllegalStateException("another thread holds the 
lock");
}

Then, set it to false in your finally block. 

-JZ

> On Apr 7, 2017, at 9:09 AM, Amit Dalal  wrote:
> 
> Hi Jordan,
> 
> We have been using ApacheCurator for distributed locking in production 
> environment for a couple of years. Recently, we found a case wherein same 
> lock was acquired by two threads at the same time. We tried reproducing the 
> case via a main method and were able to do it.
> 
> Here's the testcase: 
> https://gist.github.com/amdalal/bf993fa9d2e2770663959b0f940bdb8f 
> 
> 
> You might find the code familiar as you had answered a question around it on 
> Stackoverflow 
> (http://stackoverflow.com/questions/29852353/apachecurator-distributed-locking-performance
>  
> ).
> 
> Attached are the logs from the testcase. We noticed that around 0.4% times, 
> same lock was acquired by multiple threads as there are multiple instances 
> like below in the logs.
> 
> T2:101:ACQUIRED:28518535141438:in 2 ms
> T1:101:ACQUIRED:28518535406490:in 1 ms
> T2:101:RELEASED:28518535421748:in 0 ms
> T1:101:RELEASED:28518535649433:in 0 ms
> 
> T1:101:ACQUIRED:28518703211491:in 1 ms
> T2:101:ACQUIRED:28518703461041:in 0 ms
> T1:101:RELEASED:28518703468657:in 0 ms
> T2:101:RELEASED:28518703690163:in 0 ms
> 
> Any idea what is going wrong? Would appreciate your time.
> 
> Thanks,
> Amit
> 



  1   2   3   >