Re: Review Request: ZOOKEEPER-999 Create an package integration project

2011-07-28 Thread Patrick Hunt

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1143/
---

(Updated 2011-07-28 07:13:52.220703)


Review request for zookeeper and Mahadev Konar.


Changes
---

update 10 from eric


Summary
---

This goal of this ticket is to generate a set of RPM/debian package which 
integrate well with RPM sets created by HADOOP-6255.


This addresses bug ZOOKEEPER-999.
https://issues.apache.org/jira/browse/ZOOKEEPER-999


Diffs (updated)
-

  ./README_packaging.txt PRE-CREATION 
  ./bin/zkCleanup.sh 1151144 
  ./bin/zkCli.sh 1151144 
  ./bin/zkEnv.sh 1151144 
  ./bin/zkServer.sh 1151144 
  ./build.xml 1151144 
  ./ivy.xml 1151144 
  ./src/contrib/build-contrib.xml 1151144 
  ./src/contrib/build.xml 1151144 
  ./src/contrib/zkpython/build.xml 1151144 
  ./src/contrib/zkpython/ivy.xml PRE-CREATION 
  ./src/contrib/zkpython/src/packages/deb/zkpython.control/control PRE-CREATION 
  ./src/contrib/zkpython/src/packages/rpm/spec/zkpython.spec PRE-CREATION 
  ./src/contrib/zkpython/src/python/setup.py 1151144 
  ./src/packages/deb/init.d/zookeeper PRE-CREATION 
  ./src/packages/deb/zookeeper.control/conffile PRE-CREATION 
  ./src/packages/deb/zookeeper.control/control PRE-CREATION 
  ./src/packages/deb/zookeeper.control/postinst PRE-CREATION 
  ./src/packages/deb/zookeeper.control/postrm PRE-CREATION 
  ./src/packages/deb/zookeeper.control/preinst PRE-CREATION 
  ./src/packages/deb/zookeeper.control/prerm PRE-CREATION 
  ./src/packages/rpm/init.d/zookeeper PRE-CREATION 
  ./src/packages/rpm/spec/zookeeper.spec PRE-CREATION 
  ./src/packages/templates/conf/zookeeper-env.sh PRE-CREATION 
  ./src/packages/update-zookeeper-env.sh PRE-CREATION 
  ./src/recipes/build-recipes.xml 1151144 
  ./src/recipes/build.xml 1151144 
  ./src/recipes/lock/build.xml 1151144 
  ./src/recipes/queue/build.xml 1151144 

Diff: https://reviews.apache.org/r/1143/diff


Testing
---


Thanks,

Patrick



[jira] [Commented] (ZOOKEEPER-999) Create an package integration project

2011-07-28 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072221#comment-13072221
 ] 

jirapos...@reviews.apache.org commented on ZOOKEEPER-999:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1143/
---

(Updated 2011-07-28 07:13:52.220703)


Review request for zookeeper and Mahadev Konar.


Changes
---

update 10 from eric


Summary
---

This goal of this ticket is to generate a set of RPM/debian package which 
integrate well with RPM sets created by HADOOP-6255.


This addresses bug ZOOKEEPER-999.
https://issues.apache.org/jira/browse/ZOOKEEPER-999


Diffs (updated)
-

  ./README_packaging.txt PRE-CREATION 
  ./bin/zkCleanup.sh 1151144 
  ./bin/zkCli.sh 1151144 
  ./bin/zkEnv.sh 1151144 
  ./bin/zkServer.sh 1151144 
  ./build.xml 1151144 
  ./ivy.xml 1151144 
  ./src/contrib/build-contrib.xml 1151144 
  ./src/contrib/build.xml 1151144 
  ./src/contrib/zkpython/build.xml 1151144 
  ./src/contrib/zkpython/ivy.xml PRE-CREATION 
  ./src/contrib/zkpython/src/packages/deb/zkpython.control/control PRE-CREATION 
  ./src/contrib/zkpython/src/packages/rpm/spec/zkpython.spec PRE-CREATION 
  ./src/contrib/zkpython/src/python/setup.py 1151144 
  ./src/packages/deb/init.d/zookeeper PRE-CREATION 
  ./src/packages/deb/zookeeper.control/conffile PRE-CREATION 
  ./src/packages/deb/zookeeper.control/control PRE-CREATION 
  ./src/packages/deb/zookeeper.control/postinst PRE-CREATION 
  ./src/packages/deb/zookeeper.control/postrm PRE-CREATION 
  ./src/packages/deb/zookeeper.control/preinst PRE-CREATION 
  ./src/packages/deb/zookeeper.control/prerm PRE-CREATION 
  ./src/packages/rpm/init.d/zookeeper PRE-CREATION 
  ./src/packages/rpm/spec/zookeeper.spec PRE-CREATION 
  ./src/packages/templates/conf/zookeeper-env.sh PRE-CREATION 
  ./src/packages/update-zookeeper-env.sh PRE-CREATION 
  ./src/recipes/build-recipes.xml 1151144 
  ./src/recipes/build.xml 1151144 
  ./src/recipes/lock/build.xml 1151144 
  ./src/recipes/queue/build.xml 1151144 

Diff: https://reviews.apache.org/r/1143/diff


Testing
---


Thanks,

Patrick



> Create an package integration project
> -
>
> Key: ZOOKEEPER-999
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-999
> Project: ZooKeeper
>  Issue Type: New Feature
>  Components: build
> Environment: Java 6, RHEL/Ubuntu
>Reporter: Eric Yang
>Assignee: Eric Yang
> Fix For: 3.4.0
>
> Attachments: ZOOKEEPER-999-1.patch, ZOOKEEPER-999-10.patch, 
> ZOOKEEPER-999-2.patch, ZOOKEEPER-999-3.patch, ZOOKEEPER-999-4.patch, 
> ZOOKEEPER-999-5.patch, ZOOKEEPER-999-6.patch, ZOOKEEPER-999-7.patch, 
> ZOOKEEPER-999-8.patch, ZOOKEEPER-999-9.patch, ZOOKEEPER-999.patch
>
>
> This goal of this ticket is to generate a set of RPM/debian package which 
> integrate well with RPM sets created by HADOOP-6255.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




RE: FW: Does abrupt kill corrupts the datadir?

2011-07-28 Thread Laxman
Thanks for the responses Mahadev, Pat and Ben.
I understand your explanation.

My only question is "Will there be any probability data loss in the scenario
mentioned?" 

>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
there is a chance of data loss.

>>if we use sigterm in the script, we would want to put a timeout in to
escalate to a -9

As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still
we may have data loss. But the probability is very less by giving a chance
to shutdown gracefully.

Please do correct me if my understanding is wrong. 
--
Laxman

-Original Message-
From: Benjamin Reed [mailto:br...@apache.org] 
Sent: Thursday, July 28, 2011 11:40 AM
To: dev@zookeeper.apache.org
Subject: Re: FW: Does abrupt kill corrupts the datadir?

i agree with pat. if we use sigterm in the script, we would want to
put a timeout in to escalate to a -9 which makes the script a bit more
complicated without reason since we don't have any exit hooks that we
want to run. zookeeper is designed to recover well from hard failures,
much worse than a kill -9. i don't think we want to change that.

ben

On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt  wrote:
> ZK has been built around the "fail fast" approach. In order to
> maintain high availability we want to ensure that restarting a server
> will result in it attempting to rejoin the quorum. IMO we would not
> want to change this (kill -9).
>
> Patrick
>
> On Tue, Jul 26, 2011 at 2:02 AM, Laxman  wrote:
>> Hi Everyone,
>>
>> Any thoughts?
>> Do we need consider changing abrupt shutdown to
>>
>> Implementations in some other hadoop eco system projects for your
reference.
>> Hadoop - kill [SIGTERM]
>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung
>> ZooKeeper - "kill -9" [SIGKILL]
>>
>>
>> -Original Message-
>> From: Laxman [mailto:lakshman...@huawei.com]
>> Sent: Wednesday, July 13, 2011 12:36 PM
>> To: 'dev@zookeeper.apache.org'
>> Subject: RE: Does abrupt kill corrupts the datadir?
>>
>> Hi Mahadev,
>>
>> Shutdown hook is just a quick thought. Another approach can be just give
a
>> kill [SIGTERM] call which can be interpreted by process.
>>
>> First look at the "kill -9" triggered the following scenario.
>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
there
>>>is a chance of dataloss.
>>
>> How does zookeeper can deal with this scenario gracefully?
>>
>> Also, I feel we should give a chance to application to shutdown
gracefully
>> before abrupt shutdown.
>>
>> http://en.wikipedia.org/wiki/SIGKILL
>>
>> Because SIGKILL gives the process no opportunity to do cleanup operations
on
>> terminating, in most system shutdown procedures an attempt is first made
to
>> terminate processes using SIGTERM, before resorting to SIGKILL.
>>
>> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/
>>
>> The application can determine what it wants to do once a SIGTERM is
>> received. While most applications will clean up their resources and stop,
>> some may not. An application may be configured to do something completely
>> different when a SIGTERM is received. Also, if the application is in a
bad
>> state, such as waiting for disk I/O, it may not be able to act on the
signal
>> that was sent.
>>
>> Most system administrators will usually resort to the more abrupt signal
>> when an application doesn't respond to a SIGTERM.
>>
>> -Original Message-
>> From: Mahadev Konar [mailto:maha...@hortonworks.com]
>> Sent: Wednesday, July 13, 2011 12:02 PM
>> To: dev@zookeeper.apache.org
>> Subject: Re: Does abrupt kill corrupts the datadir?
>>
>> Hi Laxman,
>>  The servers takes care of all the issues with data integrity, so a kill
>> -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure
>> everything works reliably is use kill -9 :).
>>
>> Thanks
>> mahadev
>>
>> On 7/12/11 11:16 PM, "Laxman"  wrote:
>>
>>>When we stop zookeeper through zkServer.sh stop, we are aborting the
>>>zookeeper process using "kill -9".
>>>
>>>
>>>
>>>129 stop)
>>>
>>>130     echo -n "Stopping zookeeper ... "
>>>
>>>131     if [ ! -f "$ZOOPIDFILE" ]
>>>
>>>132     then
>>>
>>>133       echo "error: could not find file $ZOOPIDFILE"
>>>
>>>134       exit 1
>>>
>>>135     else
>>>
>>>136       $KILL -9 $(cat "$ZOOPIDFILE")
>>>
>>>137       rm "$ZOOPIDFILE"
>>>
>>>138       echo STOPPED
>>>
>>>139       exit 0
>>>
>>>140     fi
>>>
>>>141     ;;
>>>
>>>
>>>
>>>
>>>
>>>This may corrupt the snapshot and transaction logs. Also, its not
>>>recommended to use "kill -9".
>>>
>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
there
>>>is a chance of dataloss.
>>>
>>>
>>>
>>>How about introducing a shutdown hook which will ensure zookeeper is
>>>shutdown gracefully when we call stop?
>>>
>>>
>>>
>>>Note: This is just an observation and its not found in a test.
>>>
>>>
>>>
>>>--
>>>
>>>Thanks,
>>>
>>>Laxman
>>>
>>
>>
>>
>



[jira] [Commented] (ZOOKEEPER-954) Findbugs/ClientCnxn: Bug type JLM_JSR166_UTILCONCURRENT_MONITORENTER

2011-07-28 Thread Laxman (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072259#comment-13072259
 ] 

Laxman commented on ZOOKEEPER-954:
--

Findbugs description for this says
"This method performs synchronization an object that implements 
java.util.concurrent.locks.Lock. Such an object is locked/unlocked using 
acquire()/release() rather than using the synchronized (...) construct."

But in our code[ClientCnxt] we used LinkedBlockingQueue [waitingEvents] which 
does not implement Lock. Also, from code it looks intentional and appropriate 
to synchronize waitingEvents explicitly to ensure no event is dropped from 
processing.

Any other thoughts on this?

> Findbugs/ClientCnxn: Bug type JLM_JSR166_UTILCONCURRENT_MONITORENTER
> 
>
> Key: ZOOKEEPER-954
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-954
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client
>Reporter: Thomas Koch
>Priority: Minor
>
> JLM   Synchronization performed on java.util.concurrent.LinkedBlockingQueue 
> in org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn$Packet)
>   
> Bug type JLM_JSR166_UTILCONCURRENT_MONITORENTER (click for details)
> In class org.apache.zookeeper.ClientCnxn$EventThread
> In method 
> org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn$Packet)
> Type java.util.concurrent.LinkedBlockingQueue
> Value loaded from field 
> org.apache.zookeeper.ClientCnxn$EventThread.waitingEvents
> At ClientCnxn.java:[line 411]
> JLM   Synchronization performed on java.util.concurrent.LinkedBlockingQueue 
> in org.apache.zookeeper.ClientCnxn$EventThread.run()
>   
> Bug type JLM_JSR166_UTILCONCURRENT_MONITORENTER (click for details)
> In class org.apache.zookeeper.ClientCnxn$EventThread
> In method org.apache.zookeeper.ClientCnxn$EventThread.run()
> Type java.util.concurrent.LinkedBlockingQueue
> Value loaded from field 
> org.apache.zookeeper.ClientCnxn$EventThread.waitingEvents
> At ClientCnxn.java:[line 436]
> The respective code:
> 409  public void queuePacket(Packet packet) {
> 410 if (wasKilled) {
> 411synchronized (waitingEvents) {
> 412   if (isRunning) waitingEvents.add(packet);
> 413   else processEvent(packet);
> 414}
> 415 } else {
> 416waitingEvents.add(packet);
> 417 }
> 418  }
> 419   
> 420   public void queueEventOfDeath() {
> 421   waitingEvents.add(eventOfDeath);
> 422   }
> 423   
> 424   @Override
> 425   public void run() {
> 426  try {
> 427 isRunning = true;
> 428 while (true) {
> 429Object event = waitingEvents.take();
> 430if (event == eventOfDeath) {
> 431   wasKilled = true;
> 432} else {
> 433   processEvent(event);
> 434}
> 435if (wasKilled)
> 436   synchronized (waitingEvents) {
> 437  if (waitingEvents.isEmpty()) {
> 438 isRunning = false;
> 439 break;
> 440  }
> 441   }
> 442 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-954) Findbugs/ClientCnxn: Bug type JLM_JSR166_UTILCONCURRENT_MONITORENTER

2011-07-28 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072403#comment-13072403
 ] 

Benjamin Reed commented on ZOOKEEPER-954:
-

the synchronization on waitingEvents is to protect isRunning.

> Findbugs/ClientCnxn: Bug type JLM_JSR166_UTILCONCURRENT_MONITORENTER
> 
>
> Key: ZOOKEEPER-954
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-954
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client
>Reporter: Thomas Koch
>Priority: Minor
>
> JLM   Synchronization performed on java.util.concurrent.LinkedBlockingQueue 
> in org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn$Packet)
>   
> Bug type JLM_JSR166_UTILCONCURRENT_MONITORENTER (click for details)
> In class org.apache.zookeeper.ClientCnxn$EventThread
> In method 
> org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn$Packet)
> Type java.util.concurrent.LinkedBlockingQueue
> Value loaded from field 
> org.apache.zookeeper.ClientCnxn$EventThread.waitingEvents
> At ClientCnxn.java:[line 411]
> JLM   Synchronization performed on java.util.concurrent.LinkedBlockingQueue 
> in org.apache.zookeeper.ClientCnxn$EventThread.run()
>   
> Bug type JLM_JSR166_UTILCONCURRENT_MONITORENTER (click for details)
> In class org.apache.zookeeper.ClientCnxn$EventThread
> In method org.apache.zookeeper.ClientCnxn$EventThread.run()
> Type java.util.concurrent.LinkedBlockingQueue
> Value loaded from field 
> org.apache.zookeeper.ClientCnxn$EventThread.waitingEvents
> At ClientCnxn.java:[line 436]
> The respective code:
> 409  public void queuePacket(Packet packet) {
> 410 if (wasKilled) {
> 411synchronized (waitingEvents) {
> 412   if (isRunning) waitingEvents.add(packet);
> 413   else processEvent(packet);
> 414}
> 415 } else {
> 416waitingEvents.add(packet);
> 417 }
> 418  }
> 419   
> 420   public void queueEventOfDeath() {
> 421   waitingEvents.add(eventOfDeath);
> 422   }
> 423   
> 424   @Override
> 425   public void run() {
> 426  try {
> 427 isRunning = true;
> 428 while (true) {
> 429Object event = waitingEvents.take();
> 430if (event == eventOfDeath) {
> 431   wasKilled = true;
> 432} else {
> 433   processEvent(event);
> 434}
> 435if (wasKilled)
> 436   synchronized (waitingEvents) {
> 437  if (waitingEvents.isEmpty()) {
> 438 isRunning = false;
> 439 break;
> 440  }
> 441   }
> 442 }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: FW: Does abrupt kill corrupts the datadir?

2011-07-28 Thread Benjamin Reed
almost everything we do in zookkeeper is to make sure that we don't
lose data in much worse scenarios. the probably of a loss in this
scenario is really just the probability of a bug in the code. i don't
think that kill -TERM vs kill -KILL changes that probability at all
either way.

ben

On Thu, Jul 28, 2011 at 12:50 AM, Laxman  wrote:
> Thanks for the responses Mahadev, Pat and Ben.
> I understand your explanation.
>
> My only question is "Will there be any probability data loss in the scenario
> mentioned?"
>
In worst case, if latest snaps in all zookeeper nodes gets corrupted
> there is a chance of data loss.
>
>>>if we use sigterm in the script, we would want to put a timeout in to
> escalate to a -9
>
> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still
> we may have data loss. But the probability is very less by giving a chance
> to shutdown gracefully.
>
> Please do correct me if my understanding is wrong.
> --
> Laxman
>
> -Original Message-
> From: Benjamin Reed [mailto:br...@apache.org]
> Sent: Thursday, July 28, 2011 11:40 AM
> To: dev@zookeeper.apache.org
> Subject: Re: FW: Does abrupt kill corrupts the datadir?
>
> i agree with pat. if we use sigterm in the script, we would want to
> put a timeout in to escalate to a -9 which makes the script a bit more
> complicated without reason since we don't have any exit hooks that we
> want to run. zookeeper is designed to recover well from hard failures,
> much worse than a kill -9. i don't think we want to change that.
>
> ben
>
> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt  wrote:
>> ZK has been built around the "fail fast" approach. In order to
>> maintain high availability we want to ensure that restarting a server
>> will result in it attempting to rejoin the quorum. IMO we would not
>> want to change this (kill -9).
>>
>> Patrick
>>
>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman  wrote:
>>> Hi Everyone,
>>>
>>> Any thoughts?
>>> Do we need consider changing abrupt shutdown to
>>>
>>> Implementations in some other hadoop eco system projects for your
> reference.
>>> Hadoop - kill [SIGTERM]
>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung
>>> ZooKeeper - "kill -9" [SIGKILL]
>>>
>>>
>>> -Original Message-
>>> From: Laxman [mailto:lakshman...@huawei.com]
>>> Sent: Wednesday, July 13, 2011 12:36 PM
>>> To: 'dev@zookeeper.apache.org'
>>> Subject: RE: Does abrupt kill corrupts the datadir?
>>>
>>> Hi Mahadev,
>>>
>>> Shutdown hook is just a quick thought. Another approach can be just give
> a
>>> kill [SIGTERM] call which can be interpreted by process.
>>>
>>> First look at the "kill -9" triggered the following scenario.
In worst case, if latest snaps in all zookeeper nodes gets corrupted
> there
is a chance of dataloss.
>>>
>>> How does zookeeper can deal with this scenario gracefully?
>>>
>>> Also, I feel we should give a chance to application to shutdown
> gracefully
>>> before abrupt shutdown.
>>>
>>> http://en.wikipedia.org/wiki/SIGKILL
>>>
>>> Because SIGKILL gives the process no opportunity to do cleanup operations
> on
>>> terminating, in most system shutdown procedures an attempt is first made
> to
>>> terminate processes using SIGTERM, before resorting to SIGKILL.
>>>
>>> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/
>>>
>>> The application can determine what it wants to do once a SIGTERM is
>>> received. While most applications will clean up their resources and stop,
>>> some may not. An application may be configured to do something completely
>>> different when a SIGTERM is received. Also, if the application is in a
> bad
>>> state, such as waiting for disk I/O, it may not be able to act on the
> signal
>>> that was sent.
>>>
>>> Most system administrators will usually resort to the more abrupt signal
>>> when an application doesn't respond to a SIGTERM.
>>>
>>> -Original Message-
>>> From: Mahadev Konar [mailto:maha...@hortonworks.com]
>>> Sent: Wednesday, July 13, 2011 12:02 PM
>>> To: dev@zookeeper.apache.org
>>> Subject: Re: Does abrupt kill corrupts the datadir?
>>>
>>> Hi Laxman,
>>>  The servers takes care of all the issues with data integrity, so a kill
>>> -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure
>>> everything works reliably is use kill -9 :).
>>>
>>> Thanks
>>> mahadev
>>>
>>> On 7/12/11 11:16 PM, "Laxman"  wrote:
>>>
When we stop zookeeper through zkServer.sh stop, we are aborting the
zookeeper process using "kill -9".



129 stop)

130     echo -n "Stopping zookeeper ... "

131     if [ ! -f "$ZOOPIDFILE" ]

132     then

133       echo "error: could not find file $ZOOPIDFILE"

134       exit 1

135     else

136       $KILL -9 $(cat "$ZOOPIDFILE")

137       rm "$ZOOPIDFILE"

138       echo STOPPED

139       exit 0

140     fi

141     ;;





This may corrupt the s

Re: Out of memory running ZK unit tests against trunk

2011-07-28 Thread Patrick Hunt
I tracked this down to a low ulimit setting on the particular jenkins
host where this was failing (max processes).

Specifically the following test was failing on trunk, but not on
branch 3_3, which concerns me
./src/java/test/org/apache/zookeeper/test/QuorumZxidSyncTest.java

there haven't been any real changes to this test between versions, any
insight into why the server is using more threads in trunk vs
branch33?

Patrick

On Fri, Jul 22, 2011 at 10:58 AM, Patrick Hunt  wrote:
> I've never seen this before, but in my CI environment (sun jdk
> 1.6.0_20) I'm seeing some intermittent failures such as the following.
>
> Has anyone added/modified tests for 3.4.0 that might be using more
> threads/memory than previously? Creating ZK clients but not closing
> them, etc...
>
> java.lang.OutOfMemoryError: unable to create new native thread
>       at java.lang.Thread.start0(Native Method)
>       at java.lang.Thread.start(Thread.java:597)
>       at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.start(NIOServerCnxnFactory.java:114)
>       at 
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:406)
>       at 
> org.apache.zookeeper.test.QuorumBase.startServers(QuorumBase.java:186)
>       at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:103)
>       at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:67)
>       at 
> org.apache.zookeeper.test.QuorumZxidSyncTest.setUp(QuorumZxidSyncTest.java:39)
>
> Patrick
>


Re: FW: Does abrupt kill corrupts the datadir?

2011-07-28 Thread Andrei Savu
I've been  doing some testing in the past for this scenario and I've
seen no data loss over an extended period of time (a day).

Testing steps:
0. start an ensemble running 5 servers
1. start an workload generator (e.g.  push a strictly increasing
sequence of numbers to a queue stored in zookeeper)
every few seconds: kill the cluster leader (-9) and restart

You should be careful how you handle ConnectionLossException &
OperationTimeoutException

You can find the code for this test here (executed against the trunk version):
https://github.com/andreisavu/zookeeper-mq

-- Andrei Savu / andreisavu.ro

On Thu, Jul 28, 2011 at 9:05 AM, Benjamin Reed  wrote:
> almost everything we do in zookkeeper is to make sure that we don't
> lose data in much worse scenarios. the probably of a loss in this
> scenario is really just the probability of a bug in the code. i don't
> think that kill -TERM vs kill -KILL changes that probability at all
> either way.
>
> ben
>
> On Thu, Jul 28, 2011 at 12:50 AM, Laxman  wrote:
>> Thanks for the responses Mahadev, Pat and Ben.
>> I understand your explanation.
>>
>> My only question is "Will there be any probability data loss in the scenario
>> mentioned?"
>>
>In worst case, if latest snaps in all zookeeper nodes gets corrupted
>> there is a chance of data loss.
>>
if we use sigterm in the script, we would want to put a timeout in to
>> escalate to a -9
>>
>> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still
>> we may have data loss. But the probability is very less by giving a chance
>> to shutdown gracefully.
>>
>> Please do correct me if my understanding is wrong.
>> --
>> Laxman
>>
>> -Original Message-
>> From: Benjamin Reed [mailto:br...@apache.org]
>> Sent: Thursday, July 28, 2011 11:40 AM
>> To: dev@zookeeper.apache.org
>> Subject: Re: FW: Does abrupt kill corrupts the datadir?
>>
>> i agree with pat. if we use sigterm in the script, we would want to
>> put a timeout in to escalate to a -9 which makes the script a bit more
>> complicated without reason since we don't have any exit hooks that we
>> want to run. zookeeper is designed to recover well from hard failures,
>> much worse than a kill -9. i don't think we want to change that.
>>
>> ben
>>
>> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt  wrote:
>>> ZK has been built around the "fail fast" approach. In order to
>>> maintain high availability we want to ensure that restarting a server
>>> will result in it attempting to rejoin the quorum. IMO we would not
>>> want to change this (kill -9).
>>>
>>> Patrick
>>>
>>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman  wrote:
 Hi Everyone,

 Any thoughts?
 Do we need consider changing abrupt shutdown to

 Implementations in some other hadoop eco system projects for your
>> reference.
 Hadoop - kill [SIGTERM]
 HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung
 ZooKeeper - "kill -9" [SIGKILL]


 -Original Message-
 From: Laxman [mailto:lakshman...@huawei.com]
 Sent: Wednesday, July 13, 2011 12:36 PM
 To: 'dev@zookeeper.apache.org'
 Subject: RE: Does abrupt kill corrupts the datadir?

 Hi Mahadev,

 Shutdown hook is just a quick thought. Another approach can be just give
>> a
 kill [SIGTERM] call which can be interpreted by process.

 First look at the "kill -9" triggered the following scenario.
>In worst case, if latest snaps in all zookeeper nodes gets corrupted
>> there
>is a chance of dataloss.

 How does zookeeper can deal with this scenario gracefully?

 Also, I feel we should give a chance to application to shutdown
>> gracefully
 before abrupt shutdown.

 http://en.wikipedia.org/wiki/SIGKILL

 Because SIGKILL gives the process no opportunity to do cleanup operations
>> on
 terminating, in most system shutdown procedures an attempt is first made
>> to
 terminate processes using SIGTERM, before resorting to SIGKILL.

 http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/

 The application can determine what it wants to do once a SIGTERM is
 received. While most applications will clean up their resources and stop,
 some may not. An application may be configured to do something completely
 different when a SIGTERM is received. Also, if the application is in a
>> bad
 state, such as waiting for disk I/O, it may not be able to act on the
>> signal
 that was sent.

 Most system administrators will usually resort to the more abrupt signal
 when an application doesn't respond to a SIGTERM.

 -Original Message-
 From: Mahadev Konar [mailto:maha...@hortonworks.com]
 Sent: Wednesday, July 13, 2011 12:02 PM
 To: dev@zookeeper.apache.org
 Subject: Re: Does abrupt kill corrupts the datadir?

 Hi Laxman,
  The servers takes care of all the issues with data integrity, so a kill
 -9 is OK. Shu

[jira] [Commented] (ZOOKEEPER-1034) perl bindings should automatically find the zookeeper c-client headers

2011-07-28 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072594#comment-13072594
 ] 

Mahadev konar commented on ZOOKEEPER-1034:
--

Looks like the jenkins builds will be down for a while. Nichols can you post 
the results of ant test for this patch?

> perl bindings should automatically find the zookeeper c-client headers
> --
>
> Key: ZOOKEEPER-1034
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1034
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: contrib
>Affects Versions: 3.3.3
>Reporter: Nicholas Harteau
>Assignee: Nicholas Harteau
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: ZOOKEEPER-1034-trunk.patch
>
>
> Installing Net::ZooKeeper from cpan or the zookeeper distribution tarballs 
> will always fail due to not finding c-client header files.  In conjunction 
> with ZOOKEEPER-1033 update perl bindings to look for c-client header files in 
> INCDIR/zookeeper/
> a.k.a. make installs of Net::ZooKeeper via cpan/cpanm/whatever *just work*, 
> assuming you've already got the zookeeper c client installed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (ZOOKEEPER-1034) perl bindings should automatically find the zookeeper c-client headers

2011-07-28 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072594#comment-13072594
 ] 

Mahadev konar edited comment on ZOOKEEPER-1034 at 7/28/11 11:17 PM:


Looks like the jenkins builds will be down for a while. Nicholas can you post 
the results of ant test for this patch?

  was (Author: mahadev):
Looks like the jenkins builds will be down for a while. Nichols can you 
post the results of ant test for this patch?
  
> perl bindings should automatically find the zookeeper c-client headers
> --
>
> Key: ZOOKEEPER-1034
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1034
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: contrib
>Affects Versions: 3.3.3
>Reporter: Nicholas Harteau
>Assignee: Nicholas Harteau
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: ZOOKEEPER-1034-trunk.patch
>
>
> Installing Net::ZooKeeper from cpan or the zookeeper distribution tarballs 
> will always fail due to not finding c-client header files.  In conjunction 
> with ZOOKEEPER-1033 update perl bindings to look for c-client header files in 
> INCDIR/zookeeper/
> a.k.a. make installs of Net::ZooKeeper via cpan/cpanm/whatever *just work*, 
> assuming you've already got the zookeeper c client installed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Out of memory running ZK unit tests against trunk

2011-07-28 Thread Mahadev Konar
Nice find Pat. I cant see a reason on why that should happen. Can we
just do a stack dump and compare?

thanks
mahadev

On Thu, Jul 28, 2011 at 1:54 PM, Patrick Hunt  wrote:
> I tracked this down to a low ulimit setting on the particular jenkins
> host where this was failing (max processes).
>
> Specifically the following test was failing on trunk, but not on
> branch 3_3, which concerns me
>    ./src/java/test/org/apache/zookeeper/test/QuorumZxidSyncTest.java
>
> there haven't been any real changes to this test between versions, any
> insight into why the server is using more threads in trunk vs
> branch33?
>
> Patrick
>
> On Fri, Jul 22, 2011 at 10:58 AM, Patrick Hunt  wrote:
>> I've never seen this before, but in my CI environment (sun jdk
>> 1.6.0_20) I'm seeing some intermittent failures such as the following.
>>
>> Has anyone added/modified tests for 3.4.0 that might be using more
>> threads/memory than previously? Creating ZK clients but not closing
>> them, etc...
>>
>> java.lang.OutOfMemoryError: unable to create new native thread
>>       at java.lang.Thread.start0(Native Method)
>>       at java.lang.Thread.start(Thread.java:597)
>>       at 
>> org.apache.zookeeper.server.NIOServerCnxnFactory.start(NIOServerCnxnFactory.java:114)
>>       at 
>> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:406)
>>       at 
>> org.apache.zookeeper.test.QuorumBase.startServers(QuorumBase.java:186)
>>       at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:103)
>>       at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:67)
>>       at 
>> org.apache.zookeeper.test.QuorumZxidSyncTest.setUp(QuorumZxidSyncTest.java:39)
>>
>> Patrick
>>
>


Re: Out of memory running ZK unit tests against trunk

2011-07-28 Thread Patrick Hunt
Near the end of this test (QuorumZxidSyncTest) there are tons of
threads running - 115 "ProcessThread" threads, similar numbers of
SessionTracker.

Also I see ~100 ReadOnlyRequestProcessor - why is this running as a
separate thread? (henry/flavio?)

Regardless, I'll enter a 3.4.0 blocker to clean this up - I suspect
that the server shutdown is not shutting down fully for some reason.

Patrick

On Thu, Jul 28, 2011 at 5:28 PM, Mahadev Konar  wrote:
> Nice find Pat. I cant see a reason on why that should happen. Can we
> just do a stack dump and compare?
>
> thanks
> mahadev
>
> On Thu, Jul 28, 2011 at 1:54 PM, Patrick Hunt  wrote:
>> I tracked this down to a low ulimit setting on the particular jenkins
>> host where this was failing (max processes).
>>
>> Specifically the following test was failing on trunk, but not on
>> branch 3_3, which concerns me
>>    ./src/java/test/org/apache/zookeeper/test/QuorumZxidSyncTest.java
>>
>> there haven't been any real changes to this test between versions, any
>> insight into why the server is using more threads in trunk vs
>> branch33?
>>
>> Patrick
>>
>> On Fri, Jul 22, 2011 at 10:58 AM, Patrick Hunt  wrote:
>>> I've never seen this before, but in my CI environment (sun jdk
>>> 1.6.0_20) I'm seeing some intermittent failures such as the following.
>>>
>>> Has anyone added/modified tests for 3.4.0 that might be using more
>>> threads/memory than previously? Creating ZK clients but not closing
>>> them, etc...
>>>
>>> java.lang.OutOfMemoryError: unable to create new native thread
>>>       at java.lang.Thread.start0(Native Method)
>>>       at java.lang.Thread.start(Thread.java:597)
>>>       at 
>>> org.apache.zookeeper.server.NIOServerCnxnFactory.start(NIOServerCnxnFactory.java:114)
>>>       at 
>>> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:406)
>>>       at 
>>> org.apache.zookeeper.test.QuorumBase.startServers(QuorumBase.java:186)
>>>       at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:103)
>>>       at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:67)
>>>       at 
>>> org.apache.zookeeper.test.QuorumZxidSyncTest.setUp(QuorumZxidSyncTest.java:39)
>>>
>>> Patrick
>>>
>>
>