[jira] [Created] (ZOOKEEPER-3724) [Java Client] - Calculation of connectionTimeout needs improvement.

2020-02-12 Thread Deepak Vilakkat (Jira)
Deepak Vilakkat created ZOOKEEPER-3724:
--

 Summary: [Java Client] - Calculation of connectionTimeout needs 
improvement.
 Key: ZOOKEEPER-3724
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3724
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Reporter: Deepak Vilakkat


[https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L439]

This makes scaling zookeeper an issue without notifying all clients that they 
need to increase the sessionTimeout to a large value. Already had a production 
outage when a client in an asia data-center was trying to write to a zookeeper 
server in america for cross-colo announcements. The session timeout was kept at 
5000ms and was working all the while but the cluster size was increased which 
made this value less than 200ms. Since its technically impossible to connect 
with this value, we increased session timeout.

 

Shouldn't there be a floor value like 5 seconds, below which this value 
shouldn't drop. Theoretically this calculation can make connecting over Local 
network also timeout in some use cases.

 

This was also discussed in 
[http://zookeeper-user.578899.n2.nabble.com/How-to-modify-Client-Connection-timer-td7583017.html#a7583019]
 and I am trying to understand if there is some other catch for this 
implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Jenkins build became unstable: zookeeper-master-maven-jdk13 #67

2020-02-12 Thread Apache Jenkins Server
See 




Jenkins build is back to stable : zookeeper-master-maven-jdk12 #373

2020-02-12 Thread Apache Jenkins Server
See 




Re: [VOTE] Apache ZooKeeper release 3.5.7 candidate 2

2020-02-12 Thread Patrick Hunt
+1 xsum/sig verify, rat ran clean, compiled from source and ran some manual
testing.

Patrick

On Mon, Feb 10, 2020 at 7:23 AM Andor Molnar  wrote:

> +1 (binding)
>
> - release notes are OK,
> - documentation looks good,
> - verified signatures, checksum,
> - Java & C unit tests passed,
> - verified 3-node cluster with zk-latencies.py (create, get, delete,
> setAcl, getAcl, watchers)
>
> Andor
>
>
>
> > On 2020. Feb 10., at 12:52, Norbert Kalmar  wrote:
> >
> > This is the third bugfix release candidate for 3.5.7. It fixes 25 issues,
> > including third party CVE fixes, potential data loss and potential split
> > brain if some rare conditions exists.
> >
> > There are 4 additional patches compared to rc0 and rc1:
> > - ZOOKEEPER-3453: missing 'SET' in zkCli on windows
> > - ZOOKEEPER-3716: upgrade netty 4.1.42 to address CVE-2019-20444 CVE-20…
> > - ZOOKEEPER-3718: The tarball generated by assembly is missing some files
> > - ZOOKEEPER-3719: Fix C Client compilation issues
> >
> > The full release notes are available at:
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310801&version=12346098
> >
> > *** Please download, test and vote by February 13th 2020, 23:59 UTC+0.
> ***
> >
> > Source files:
> > https://people.apache.org/~nkalmar/zookeeper-3.5.7-candidate-2/
> >
> > Maven staging repo:
> >
> https://repository.apache.org/content/groups/staging/org/apache/zookeeper/zookeeper/3.5.7/
> >
> > The release candidate tag in git to be voted upon: release-3.5.7-rc2
> >
> > ZooKeeper's KEYS file containing PGP keys we use to sign the release:
> > https://www.apache.org/dist/zookeeper/KEYS
> >
> > Should we release this candidate?
>
>


Build failed in Jenkins: zookeeper-branch36-java11 #48

2020-02-12 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 58.29 KB...]
Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 


Jenkins build became unstable: zookeeper-master-maven-jdk12 #372

2020-02-12 Thread Apache Jenkins Server
See 




[jira] [Created] (ZOOKEEPER-3723) Zookeeper Client should not fail with ZSYSTEMERROR if DNS does not resolve one of the servers in the zk ensemble.

2020-02-12 Thread Suhas Dantkale (Jira)
Suhas Dantkale created ZOOKEEPER-3723:
-

 Summary: Zookeeper Client should not fail with ZSYSTEMERROR if DNS 
does not resolve one of the servers in the zk ensemble. 
 Key: ZOOKEEPER-3723
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3723
 Project: ZooKeeper
  Issue Type: Improvement
  Components: c client, java client
Reporter: Suhas Dantkale


This is a minor enhancement request to not fail the session initiation if the 
DNS is not able to resolve the hostname of one of the servers in the Zookeeper 
ensemble.

 

The Zookeeper client resolves all the hostnames in the ensemble while 
establishing the session.

In Kubernetes environment with coreDNS, the hostname entry gets removed from 
coreDNS during the POD restarts. Though we can manipulate the coreDNS settings 
to delay the removal of the hostname entry from DNS, we don't want to leave any 
race where Zookeeper clinet is trying to establish a session and it fails 
because the DNS temporarily is not able to resolve the hostname. So as long as 
one of the servers in the ensemble is able to be DNS resolvable, should we not 
fail the session establishment with hard error and instead try to establish the 
connection with one of the other servers?

 

Look at the below snippet where  resolve_hosts() fails with ZSYSTEMERROR.
{code:java}
if ((rc = getaddrinfo(host, port_spec, &hints, &res0)) != 0) {
            //bug in getaddrinfo implementation when it returns
            //EAI_BADFLAGS or EAI_ADDRFAMILY with AF_UNSPEC and
            // ai_flags as AI_ADDRCONFIG
#ifdef AI_ADDRCONFIG
            if ((hints.ai_flags == AI_ADDRCONFIG) &&
// ZOOKEEPER-1323 EAI_NODATA and EAI_ADDRFAMILY are deprecated in FreeBSD.
#ifdef EAI_ADDRFAMILY
                ((rc ==EAI_BADFLAGS) || (rc == EAI_ADDRFAMILY))) {
#else
                (rc == EAI_BADFLAGS)) {
#endif
                //reset ai_flags to null
                hints.ai_flags = 0;
                //retry getaddrinfo
                rc = getaddrinfo(host, port_spec, &hints, &res0);
            }
#endif
            if (rc != 0) {
                errno = getaddrinfo_errno(rc);
#ifdef _WIN32
                LOG_ERROR(LOGCALLBACK(zh), "Win32 message: %s\n", 
gai_strerror(rc));
#elif __linux__ && __GNUC__
                LOG_ERROR(LOGCALLBACK(zh), "getaddrinfo: %s\n", 
gai_strerror(rc));
#else
                LOG_ERROR(LOGCALLBACK(zh), "getaddrinfo: %s\n", 
strerror(errno));
#endif
                rc=ZSYSTEMERROR;
                goto fail;
            }
        }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Jenkins build is back to stable : zookeeper-master-maven-jdk13 #65

2020-02-12 Thread Apache Jenkins Server
See 




Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-12 Thread Michael K. Edwards
Well, I think it's fair to say that they're effectively untested prior to
rc2.  But it's a reasonable posture to take that the features get baked
first and the field upgrade procedure gets tested late in the release
cycle.  Not what I would have expected personally, though, as a former
developer of field upgradable consumer electronics.

What we used to call the "first article" (the firmware delivered to the
manufacturing line) was routinely unusable for much of anything beyond the
first-time setup procedure and download of over-the-air updates, and a
firmware upgrade/downgrade cycle was the first step in smoke testing every
subsequent release candidate.  Seems like the concerns (the high cost of
upgrade failure and likelihood of permanently losing customer trust) are
similar.  But when one doesn't face a hard August ship date for the first
article for a Christmas-shopping-season product release, I suppose one can
afford a different order of operations.

I am very grateful for the ZooKeeper software and for the care and
resources that its maintainers and community put into its integrity and
vitality.  I value the release engineering process, and don't take for
granted that any given snapshot off of a release branch is fit for
purpose.  At the same time, I'd feel more confident recommending it for
more use cases within the engineering organizations I support if there were
stronger test scaffolding around version migration and similar production
operations.  That's something I'd like to help with in the future,
resources permitting.

Cheers,
- Michael

On Wed, Feb 12, 2020 at 4:16 AM Andor Molnar  wrote:

> Hi Michael,
>
> "if we can get to rc2 without noticing a showstopper…”
>
> 200% disagree with this.
>
> The whole point of release voting system is to identify problems no matter
> how big they are. The message of finding a showstopper for me is that
> people paying attention and accurately testing the release. This is a very
> good thing and emphasises how much effort the ZooKeeper community is
> putting into every single release. Otherwise we could just set up a Jenkins
> job which creates and publishes a new release in every six months and say
> good luck with them.
>
> I admit that currently we don’t have (rolling) upgrade tests, but I feel
> demand from the community to fill this gap.
>
> “rolling upgrades (and mixed ensembles generally) are effectively untested”
>
> Not true. That’s exactly what we are currently doing (manually for now).
>
> "there have to be a hundred corner cases beyond the MultiAddress issue”
>
> Sure thing. True for every new feature in every release. That’s why I’m
> happy disabling it by default. People usually don’t pick up releases ending
> with .0, production upgrades are expected from .1 or .2 or maybe later
> depending on how much risk would like to be taken.
>
> Andor
>
>
>
> > On 2020. Feb 11., at 23:35, Michael K. Edwards 
> wrote:
> >
> > I think it would be prudent to emphasize in the release notes that
> rolling
> > upgrades (and mixed ensembles generally) are effectively untested.  That
> > this was, in practice, a non-goal of this release cycle.  Because if we
> can
> > get to rc2 without noticing a showstopper, clearly it's not something
> that
> > anyone has gotten around to attempting; and there have to be a hundred
> > corner cases beyond the MultiAddress issue.
> >
> > On Tue, Feb 11, 2020 at 12:27 PM Szalay-Bekő Máté <
> > szalay.beko.m...@gmail.com> wrote:
> >
> >> I see the main problem here in the fact that we are missing proper
> >> versioning in the leader election / quorum protocols. I tried to simply
> >> implement backward compatibility in 3.6, but it didn't solve the
> problem.
> >> The new code understands the old protocol, but it can not decide when to
> >> use the new or the old protocol during connection initiation. So the old
> >> servers can not read the new init messages and we still temporarly end
> up
> >> having two partitions during rolling restart.
> >>
> >> I already suggested two ways to handle this later, but I think for 3.6.0
> >> now the simplest solution is to disable the new MultiAddress feature and
> >> stick to the old protocol version by default. Plus extend the
> >> documentation with the note, that enabling the MultiAddress feature is
> not
> >> possible during a rolling upgrade, but it needs to be done with a
> separate
> >> rolling restart. With this approach, the rolling restart should "just
> work"
> >> with the 3.4 / 3.5 configs and we don't require any extra step /
> >> configuration from the users, unless they want to use the new feature. I
> >> plan to submit a PR with these changes tomorrow to ZOOKEEPER-3720, if
> there
> >> isn't any different opinion.
> >>
> >> P.S. For 4.0 we might need to put some extra thinking into backward
> >> compatibility / versioning for the quorum and client protocols.
> >>
> >>
> >> On Tue, Feb 11, 2020, 20:44 Michael K. Edwards 
> >> wrote:
> >>
> >>> I hate to say it, but I think 3.6.0

Build failed in Jenkins: zookeeper-branch36-java8 #47

2020-02-12 Thread Apache Jenkins Server
See 

Changes:


--
[...truncated 51.32 KB...]
Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 

Generating 


Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-12 Thread Szalay-Bekő Máté
FYI: PR just submitted, see  https://github.com/apache/zookeeper/pull/1251
any comments welcomed! :)

Kind regards,
Mate

On Wed, Feb 12, 2020 at 1:16 PM Andor Molnar  wrote:

> Hi Michael,
>
> "if we can get to rc2 without noticing a showstopper…”
>
> 200% disagree with this.
>
> The whole point of release voting system is to identify problems no matter
> how big they are. The message of finding a showstopper for me is that
> people paying attention and accurately testing the release. This is a very
> good thing and emphasises how much effort the ZooKeeper community is
> putting into every single release. Otherwise we could just set up a Jenkins
> job which creates and publishes a new release in every six months and say
> good luck with them.
>
> I admit that currently we don’t have (rolling) upgrade tests, but I feel
> demand from the community to fill this gap.
>
> “rolling upgrades (and mixed ensembles generally) are effectively untested”
>
> Not true. That’s exactly what we are currently doing (manually for now).
>
> "there have to be a hundred corner cases beyond the MultiAddress issue”
>
> Sure thing. True for every new feature in every release. That’s why I’m
> happy disabling it by default. People usually don’t pick up releases ending
> with .0, production upgrades are expected from .1 or .2 or maybe later
> depending on how much risk would like to be taken.
>
> Andor
>
>
>
> > On 2020. Feb 11., at 23:35, Michael K. Edwards 
> wrote:
> >
> > I think it would be prudent to emphasize in the release notes that
> rolling
> > upgrades (and mixed ensembles generally) are effectively untested.  That
> > this was, in practice, a non-goal of this release cycle.  Because if we
> can
> > get to rc2 without noticing a showstopper, clearly it's not something
> that
> > anyone has gotten around to attempting; and there have to be a hundred
> > corner cases beyond the MultiAddress issue.
> >
> > On Tue, Feb 11, 2020 at 12:27 PM Szalay-Bekő Máté <
> > szalay.beko.m...@gmail.com> wrote:
> >
> >> I see the main problem here in the fact that we are missing proper
> >> versioning in the leader election / quorum protocols. I tried to simply
> >> implement backward compatibility in 3.6, but it didn't solve the
> problem.
> >> The new code understands the old protocol, but it can not decide when to
> >> use the new or the old protocol during connection initiation. So the old
> >> servers can not read the new init messages and we still temporarly end
> up
> >> having two partitions during rolling restart.
> >>
> >> I already suggested two ways to handle this later, but I think for 3.6.0
> >> now the simplest solution is to disable the new MultiAddress feature and
> >> stick to the old protocol version by default. Plus extend the
> >> documentation with the note, that enabling the MultiAddress feature is
> not
> >> possible during a rolling upgrade, but it needs to be done with a
> separate
> >> rolling restart. With this approach, the rolling restart should "just
> work"
> >> with the 3.4 / 3.5 configs and we don't require any extra step /
> >> configuration from the users, unless they want to use the new feature. I
> >> plan to submit a PR with these changes tomorrow to ZOOKEEPER-3720, if
> there
> >> isn't any different opinion.
> >>
> >> P.S. For 4.0 we might need to put some extra thinking into backward
> >> compatibility / versioning for the quorum and client protocols.
> >>
> >>
> >> On Tue, Feb 11, 2020, 20:44 Michael K. Edwards 
> >> wrote:
> >>
> >>> I hate to say it, but I think 3.6.0 should release as is.  It is
> >>> impossible
> >>> to *reliably* retrofit backwards compatibility / interoperability onto
> a
> >>> release that was engineered from the beginning without that goal.
> Learn
> >>> the lesson, set goals differently in the future.
> >>>
> >>> On Tue, Feb 11, 2020 at 9:41 AM Szalay-Bekő Máté <
> >>> szalay.beko.m...@gmail.com>
> >>> wrote:
> >>>
>  FYI: I created these scripts for my local tests:
>  https://github.com/symat/zk-rolling-upgrade-test
> 
>  For the long term I would also add some script that actually monitors
> >>> the
>  state of the quorum and also runs continuous traffic, not just 1-2
>  smoketests after each restart. But I don't know how important this
> would
>  be.
> 
>  On Tue, Feb 11, 2020 at 5:25 PM Enrico Olivelli 
>  wrote:
> 
> > Il giorno mar 11 feb 2020 alle ore 17:17 Andor Molnar
> >  ha scritto:
> >>
> >> The most obvious one which crosses my mind is that I previously
> >>> worked
> > on:
> >>
> >> 1) run old version cluster,
> >> 2) connect to each node and run smoke tests,
> >> 3) restart one node with new code,
> >> 4) goto 2) until all nodes are upgraded
> >>
> >> I think this wouldn’t work in a “unit test”, we probably need a
>  separate
> > Jenkins job and a nice python script to do this.
> >>
> >> Andor
> >>
> >>
> >>
> >>
> >>> On 2020. Fe

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-12 Thread Andor Molnar
Hi Michael,

"if we can get to rc2 without noticing a showstopper…”

200% disagree with this. 

The whole point of release voting system is to identify problems no matter how 
big they are. The message of finding a showstopper for me is that people paying 
attention and accurately testing the release. This is a very good thing and 
emphasises how much effort the ZooKeeper community is putting into every single 
release. Otherwise we could just set up a Jenkins job which creates and 
publishes a new release in every six months and say good luck with them.

I admit that currently we don’t have (rolling) upgrade tests, but I feel demand 
from the community to fill this gap.

“rolling upgrades (and mixed ensembles generally) are effectively untested”

Not true. That’s exactly what we are currently doing (manually for now).

"there have to be a hundred corner cases beyond the MultiAddress issue”

Sure thing. True for every new feature in every release. That’s why I’m happy 
disabling it by default. People usually don’t pick up releases ending with .0, 
production upgrades are expected from .1 or .2 or maybe later depending on how 
much risk would like to be taken.

Andor



> On 2020. Feb 11., at 23:35, Michael K. Edwards  wrote:
> 
> I think it would be prudent to emphasize in the release notes that rolling
> upgrades (and mixed ensembles generally) are effectively untested.  That
> this was, in practice, a non-goal of this release cycle.  Because if we can
> get to rc2 without noticing a showstopper, clearly it's not something that
> anyone has gotten around to attempting; and there have to be a hundred
> corner cases beyond the MultiAddress issue.
> 
> On Tue, Feb 11, 2020 at 12:27 PM Szalay-Bekő Máté <
> szalay.beko.m...@gmail.com> wrote:
> 
>> I see the main problem here in the fact that we are missing proper
>> versioning in the leader election / quorum protocols. I tried to simply
>> implement backward compatibility in 3.6, but it didn't solve the problem.
>> The new code understands the old protocol, but it can not decide when to
>> use the new or the old protocol during connection initiation. So the old
>> servers can not read the new init messages and we still temporarly end up
>> having two partitions during rolling restart.
>> 
>> I already suggested two ways to handle this later, but I think for 3.6.0
>> now the simplest solution is to disable the new MultiAddress feature and
>> stick to the old protocol version by default. Plus extend the
>> documentation with the note, that enabling the MultiAddress feature is not
>> possible during a rolling upgrade, but it needs to be done with a separate
>> rolling restart. With this approach, the rolling restart should "just work"
>> with the 3.4 / 3.5 configs and we don't require any extra step /
>> configuration from the users, unless they want to use the new feature. I
>> plan to submit a PR with these changes tomorrow to ZOOKEEPER-3720, if there
>> isn't any different opinion.
>> 
>> P.S. For 4.0 we might need to put some extra thinking into backward
>> compatibility / versioning for the quorum and client protocols.
>> 
>> 
>> On Tue, Feb 11, 2020, 20:44 Michael K. Edwards 
>> wrote:
>> 
>>> I hate to say it, but I think 3.6.0 should release as is.  It is
>>> impossible
>>> to *reliably* retrofit backwards compatibility / interoperability onto a
>>> release that was engineered from the beginning without that goal.  Learn
>>> the lesson, set goals differently in the future.
>>> 
>>> On Tue, Feb 11, 2020 at 9:41 AM Szalay-Bekő Máté <
>>> szalay.beko.m...@gmail.com>
>>> wrote:
>>> 
 FYI: I created these scripts for my local tests:
 https://github.com/symat/zk-rolling-upgrade-test
 
 For the long term I would also add some script that actually monitors
>>> the
 state of the quorum and also runs continuous traffic, not just 1-2
 smoketests after each restart. But I don't know how important this would
 be.
 
 On Tue, Feb 11, 2020 at 5:25 PM Enrico Olivelli 
 wrote:
 
> Il giorno mar 11 feb 2020 alle ore 17:17 Andor Molnar
>  ha scritto:
>> 
>> The most obvious one which crosses my mind is that I previously
>>> worked
> on:
>> 
>> 1) run old version cluster,
>> 2) connect to each node and run smoke tests,
>> 3) restart one node with new code,
>> 4) goto 2) until all nodes are upgraded
>> 
>> I think this wouldn’t work in a “unit test”, we probably need a
 separate
> Jenkins job and a nice python script to do this.
>> 
>> Andor
>> 
>> 
>> 
>> 
>>> On 2020. Feb 11., at 16:38, Patrick Hunt 
>>> wrote:
>>> 
>>> Anyone have ideas how we could add testing for upgrade? Obviously
> something
>>> we're missing, esp given it's import.
> 
> I will send an email next days with a proposal.
> btw my idea is very like Andor's one
> 
> Once we have an automatic environment we can launch from Jenkins
> 
> Enri

Jenkins build is back to stable : zookeeper-master-maven-jdk11 #371

2020-02-12 Thread Apache Jenkins Server
See 




[jira] [Created] (ZOOKEEPER-3722) make logs of ResponseCache more readable

2020-02-12 Thread maoling (Jira)
maoling created ZOOKEEPER-3722:
--

 Summary: make logs of ResponseCache more readable
 Key: ZOOKEEPER-3722
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3722
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: maoling


The logs look like redundant:
{code:java}
2020-02-12 16:16:09,208 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:2183)(secure=disabled):ResponseCache@45]
 - Response cache size is initialized with value 400.
2020-02-12 16:16:09,208 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:2183)(secure=disabled):ResponseCache@45]
 - Response cache size is initialized with value 400.{code}
What we want is:
{code:java}
2020-02-12 16:16:09,208 [myid:3] - INFO 
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:2183)(secure=disabled):ResponseCache@45]
 - getData Response cache size is initialized with value 400. 
2020-02-12 16:16:09,208 [myid:3] - INFO 
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:2183)(secure=disabled):ResponseCache@45]
 - getChild Response cache size is initialized with value 400.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)