Re: zookeeper connection issue while running for second time

2013-10-01 Thread Jyotirmoy Sundi
Hi ,
I am able to run apache giraph successfully with around 500M pairs to
find Connected components. It works great but not always, the issue seems
to be with the time out zookeeper time out. Some of the client(around 5-10
) out of 100, produces this error and the master fails due to this.Do you
have any suggestions for this error. Any suggestions will be appreaciated.

2013-10-02 01:20:43,651 WARN org.apache.giraph.bsp.BspService:
process: Disconnected from ZooKeeper (will automatically try to
recover) WatchedEvent state:Disconnected type:None path:null
2013-10-02 01:20:44,035 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server had22.rsk.admobius.com/10.240.51.32:2181.
Will not attempt to authenticate using SASL (Unable to locate a login
configuration)
2013-10-02 01:20:44,035 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to had22.rsk.admobius.com/10.240.51.32:2181,
initiating session
2013-10-02 01:20:44,037 INFO org.apache.zookeeper.ClientCnxn: Unable
to reconnect to ZooKeeper service, session 0x441604c97412331 has
expired, closing socket connection
2013-10-02 01:20:44,037 WARN org.apache.giraph.bsp.BspService:
process: Got unknown null path event WatchedEvent state:Expired
type:None path:null
2013-10-02 01:20:44,038 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2013-10-02 01:21:20,046 INFO
org.apache.giraph.worker.VertexInputSplitsCallable:
readVertexInputSplit: Loaded 25 vertices at 1827.2925619484213
vertices/sec 1728790 edges at 12636.730317550928 edges/sec Memory
(free/total/max) = 1745.60M / 2262.19M / 2730.69M
2013-10-02 01:21:24,788 INFO
org.apache.giraph.worker.InputSplitsCallable: loadFromInputSplit:
Finished loading
/_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601 (v=261131,
e=1808572)
2013-10-02 01:21:24,789 ERROR
org.apache.giraph.utils.LogStacktraceCallable: Execution of callable
failed
java.lang.IllegalStateException: markInputSplitPathFinished:
KeeperException on
/_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601/_vertexInputSplitFinished
at 
org.apache.giraph.worker.InputSplitsHandler.markInputSplitPathFinished(InputSplitsHandler.java:168)
at 
org.apache.giraph.worker.InputSplitsCallable.loadInputSplit(InputSplitsCallable.java:226)
at 
org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:161)
at 
org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:58)
at 
org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601/_vertexInputSplitFinished
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.giraph.zk.ZooKeeperExt.createExt(ZooKeeperExt.java:152)
at 
org.apache.giraph.worker.InputSplitsHandler.markInputSplitPathFinished(InputSplitsHandler.java:159)
... 9 more


-- 
Best Regards,
Jyotirmoy Sundi
Admobius

San Francisco, CA 94158


On Thu, Sep 26, 2013 at 6:08 PM, Jyotirmoy Sundi sundi...@gmail.com wrote:

 Hi ,

I got the connected component working for 1B nodes, but when I run the job 
 again, it fails with the below error. Aprt form this in zookeeper the data is 
 not cleared in the data directory. For successful jobs the data in zookeper 
 from giraph is cleared.

 The following errors seems to be coming because the node tries to connect to 
 the zookeeper with a session id which is cleared as seens in

 Client session timed out, have not heard from server in 68845ms for 
 sessionid 0x3415cc6ce930059, closing socket connection and attempting 
 reconnect , Any idea if increasing the session time out will be good ?

 2013-09-27 00:57:11,748 WARN org.apache.giraph.bsp.BspService: process: Got 
 unknown null path event WatchedEvent state:Expired type:None path:null
 2013-09-27 00:57:11,748 INFO org.apache.zookeeper.ClientCnxn: Unable to 
 reconnect to ZooKeeper service, session 0x3415cc6ce930059 has expired, 
 closing socket connection
 2013-09-27 00:57:11,748 WARN org.apache.giraph.worker.InputSplitsHandler: 
 process: Problem with zookeeper, got event with path null, state Expired, 
 event type None
 2013-09-27 00:57:11,748 INFO org.apache.zookeeper.ClientCnxn: EventThread 
 shut down
 2013-09-27 00:57:11,925 INFO 

Re: zookeeper connection issue while running for second time

2013-10-01 Thread Avery Ching
We did have this error a few times. This can happen due to GC pauses, so 
I would check the worker for long GC issues.  Also, you can increase the 
ZooKeeper timeouts, see


  /** ZooKeeper session millisecond timeout */
  IntConfOption ZOOKEEPER_SESSION_TIMEOUT =
  new IntConfOption(giraph.zkSessionMsecTimeout, MINUTES.toMillis(1),
  ZooKeeper session millisecond timeout);

Currently, the default is one minute, but in production we set that 
number much, much higher (even greater than a day sometimes) to avoid 
the disconnection.


Hope that helps,
Avery

On 10/1/13 6:27 PM, Jyotirmoy Sundi wrote:

Hi ,
I am able to run apache giraph successfully with around 500M pairs to 
find Connected components. It works great but not always, the issue 
seems to be with the time out zookeeper time out. Some of the 
client(around 5-10 ) out of 100, produces this error and the master 
fails due to this.Do you have any suggestions for this error. Any 
suggestions will be appreaciated.

2013-10-02 01:20:43,651 WARN org.apache.giraph.bsp.BspService: process: 
Disconnected from ZooKeeper (will automatically try to recover) WatchedEvent 
state:Disconnected type:None path:null
2013-10-02 01:20:44,035 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to serverhad22.rsk.admobius.com/10.240.51.32:2181  
http://had22.rsk.admobius.com/10.240.51.32:2181. Will not attempt to 
authenticate using SASL (Unable to locate a login configuration)
2013-10-02 01:20:44,035 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established tohad22.rsk.admobius.com/10.240.51.32:2181  
http://had22.rsk.admobius.com/10.240.51.32:2181, initiating session
2013-10-02 01:20:44,037 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x441604c97412331 has expired, closing 
socket connection
2013-10-02 01:20:44,037 WARN org.apache.giraph.bsp.BspService: process: Got 
unknown null path event WatchedEvent state:Expired type:None path:null
2013-10-02 01:20:44,038 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2013-10-02 01:21:20,046 INFO 
org.apache.giraph.worker.VertexInputSplitsCallable: readVertexInputSplit: 
Loaded 25 vertices at 1827.2925619484213 vertices/sec 1728790 edges at 
12636.730317550928 edges/sec Memory (free/total/max) = 1745.60M / 2262.19M / 
2730.69M
2013-10-02 01:21:24,788 INFO org.apache.giraph.worker.InputSplitsCallable: 
loadFromInputSplit: Finished loading 
/_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601 (v=261131, e=1808572)
2013-10-02 01:21:24,789 ERROR org.apache.giraph.utils.LogStacktraceCallable: 
Execution of callable failed
java.lang.IllegalStateException: markInputSplitPathFinished: KeeperException on 
/_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601/_vertexInputSplitFinished
at 
org.apache.giraph.worker.InputSplitsHandler.markInputSplitPathFinished(InputSplitsHandler.java:168)
at 
org.apache.giraph.worker.InputSplitsCallable.loadInputSplit(InputSplitsCallable.java:226)
at 
org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:161)
at 
org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:58)
at 
org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for 
/_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601/_vertexInputSplitFinished
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.giraph.zk.ZooKeeperExt.createExt(ZooKeeperExt.java:152)
at 
org.apache.giraph.worker.InputSplitsHandler.markInputSplitPathFinished(InputSplitsHandler.java:159)
... 9 more

--
Best Regards,
Jyotirmoy Sundi
Admobius

San Francisco, CA 94158



On Thu, Sep 26, 2013 at 6:08 PM, Jyotirmoy Sundi sundi...@gmail.com 
mailto:sundi...@gmail.com wrote:


Hi ,

I got the connected component working for 1B nodes, but when I run the 
job again, it fails with the below error. Aprt form this in zookeeper the data 
is not cleared in the data directory. For successful jobs the data in zookeper 
from giraph is cleared.

The following errors seems to be coming because the node tries to connect 
to the zookeeper with a session id which is cleared as seens in

Client session timed out, have not heard from server in 68845ms for sessionid 

Re: zookeeper connection issue while running for second time

2013-10-01 Thread Jyotirmoy Sundi
Thanks a lot Avery for your response, I increased the timeout to 10 minutes
*changed:*
-Dgiraph.zkSessionMsecTimeout=60 and
-Dgiraph.useInputSplitLocality=false ,
 It is working for consecutive runs now without any errors.

Thanks
Sundi


On Tue, Oct 1, 2013 at 10:18 PM, Avery Ching ach...@apache.org wrote:

  We did have this error a few times.  This can happen due to GC pauses,
 so I would check the worker for long GC issues.  Also, you can increase the
 ZooKeeper timeouts, see

   /** ZooKeeper session millisecond timeout */
   IntConfOption ZOOKEEPER_SESSION_TIMEOUT =
   new IntConfOption(giraph.zkSessionMsecTimeout, MINUTES.toMillis(1),
   ZooKeeper session millisecond timeout);

 Currently, the default is one minute, but in production we set that number
 much, much higher (even greater than a day sometimes) to avoid the
 disconnection.

 Hope that helps,
 Avery


 On 10/1/13 6:27 PM, Jyotirmoy Sundi wrote:

 Hi ,
 I am able to run apache giraph successfully with around 500M pairs to
 find Connected components. It works great but not always, the issue seems
 to be with the time out zookeeper time out. Some of the client(around 5-10
 ) out of 100, produces this error and the master fails due to this.Do you
 have any suggestions for this error. Any suggestions will be appreaciated.

 2013-10-02 01:20:43,651 WARN org.apache.giraph.bsp.BspService: process: 
 Disconnected from ZooKeeper (will automatically try to recover) WatchedEvent 
 state:Disconnected type:None path:null
 2013-10-02 01:20:44,035 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server had22.rsk.admobius.com/10.240.51.32:2181. Will not 
 attempt to authenticate using SASL (Unable to locate a login configuration)
 2013-10-02 01:20:44,035 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to had22.rsk.admobius.com/10.240.51.32:2181, 
 initiating session
 2013-10-02 01:20:44,037 INFO org.apache.zookeeper.ClientCnxn: Unable to 
 reconnect to ZooKeeper service, session 0x441604c97412331 has expired, 
 closing socket connection
 2013-10-02 01:20:44,037 WARN org.apache.giraph.bsp.BspService: process: Got 
 unknown null path event WatchedEvent state:Expired type:None path:null
 2013-10-02 01:20:44,038 INFO org.apache.zookeeper.ClientCnxn: EventThread 
 shut down
 2013-10-02 01:21:20,046 INFO 
 org.apache.giraph.worker.VertexInputSplitsCallable: readVertexInputSplit: 
 Loaded 25 vertices at 1827.2925619484213 vertices/sec 1728790 edges at 
 12636.730317550928 edges/sec Memory (free/total/max) = 1745.60M / 2262.19M / 
 2730.69M
 2013-10-02 01:21:24,788 INFO org.apache.giraph.worker.InputSplitsCallable: 
 loadFromInputSplit: Finished loading 
 /_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601 (v=261131, 
 e=1808572)
 2013-10-02 01:21:24,789 ERROR org.apache.giraph.utils.LogStacktraceCallable: 
 Execution of callable failed
 java.lang.IllegalStateException: markInputSplitPathFinished: KeeperException 
 on 
 /_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601/_vertexInputSplitFinished
   at 
 org.apache.giraph.worker.InputSplitsHandler.markInputSplitPathFinished(InputSplitsHandler.java:168)
   at 
 org.apache.giraph.worker.InputSplitsCallable.loadInputSplit(InputSplitsCallable.java:226)
   at 
 org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:161)
   at 
 org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:58)
   at 
 org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
   at java.lang.Thread.run(Thread.java:662)
 Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: 
 KeeperErrorCode = Session expired for 
 /_hadoopBsp/job_201309260044_1132/_vertexInputSplitDir/601/_vertexInputSplitFinished
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at org.apache.giraph.zk.ZooKeeperExt.createExt(ZooKeeperExt.java:152)
   at 
 org.apache.giraph.worker.InputSplitsHandler.markInputSplitPathFinished(InputSplitsHandler.java:159)
   ... 9 more


  --
  Best Regards,
 Jyotirmoy Sundi
 Admobius

 San Francisco, CA 94158


 On Thu, Sep 26, 2013 at 6:08 PM, Jyotirmoy Sundi sundi...@gmail.comwrote:

  Hi ,

I got the connected component working for 1B nodes, but when I run the 
 job again, it fails with the below error. Aprt form this in zookeeper the 
 data is not cleared in the data directory. For successful jobs the data in 
 zookeper from giraph is