Just bumping this thread as I am still looking for answers. On Sun, Mar 20, 2016 at 11:31 AM, Anirudh Perugu < anirudh.per...@stonybrook.edu> wrote:
> Hello all, > > This is in follow up to *1. How many workers can I have? * > So I understand that per-worker parallelism is achieved using compute > threads. Hence, giraph.numComputeThreads maximum value is limited to # of > cores. What is the # of workers limited to? (cannot be 1 for my setup as > job runs successfully with 2). > > > On Sat, Mar 19, 2016 at 6:56 PM, Anirudh Perugu < > anirudh.per...@stonybrook.edu> wrote: > >> Hello All, >> >> I am a giraph newbie, so kindly bear with me. I am trying to run BFS on a >> graph which has : 28048 edges and 786 nodes. >> >> *Here is my Setup :* >> Single Node Cluster, 8GB RAM, 4 Cores, Apache Yarn 2.7.2, Giraph 1.2.0, >> my single machine has everything(yarn+giraph) running on it. >> >> *1. How many workers can I have?* >> I ask this because my giraph job runs fine with settings :* -w 1 -ca >> giraph.SplitMasterWorker=false* >> >> "Testing Results Table" >> >> *no. of workers | maximum no. of containers used | time taken for >> completion* >> 1 3 (as seen on the UI) >> 0 min 50 secs >> 2 4 >> 1 min 18 secs >> 3 4 >> Long Running Job. I think it times out after 20 minutes. >> >> for 3 workers, this is the log : >> >> >> *INFO server.PrepRequestProcessor: Got user-level KeeperException when >> processing sessionid:0x153910879aa0000 type:create cxid:0x1 zxid:0x2 >> txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0004/_masterElectionDir >> Error:KeeperErrorCode = NoNode for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0004/_masterElectionDir * >> So, this fails due to a reason I still haven't figured out, can you >> answer this ? >> >> Finally, leads me to asking how many workers can I have if I have 4 cores >> on my machine? Are # of cores and # of workers related? >> >> >> *2. How do I use the setting : giraph.yarn.task.heap.mb=x? I set x to >> 2048 but my job runs indefinitely(hangs up). Works great with default >> setting of 1024.* >> >> >> *Job takes forever when x=2048, userlogs say :* >> 6/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client >> environment:java.library.path=/home//Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:. >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client >> environment:java.io.tmpdir=/var/folders/s4/n58qlsh97t11vkmhysts8k680000gn/T/ >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client >> environment:java.compiler=<NA> >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client environment:os.name=Mac >> OS X >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client >> environment:os.arch=x86_64 >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client >> environment:os.version=10.11.3 >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client environment:user.name >> =Anirudh >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client >> environment:user.home=/Users/Anirudh >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Client >> environment:user.dir=/private/tmp/hadoop-Anirudh/nm-local-dir/usercache/Anirudh/appcache/application_1458425066569_0001/container_1458425066569_0001_01_000002 >> 16/03/19 18:05:36 INFO zookeeper.ZooKeeper: Initiating client connection, >> connectString=172.24.18.199:22181 sessionTimeout=60000 >> watcher=org.apache.giraph.master.BspServiceMaster@4097cac >> 16/03/19 18:05:36 INFO zookeeper.ClientCnxn: Opening socket connection to >> server 172.24.18.199/172.24.18.199:22181. Will not attempt to >> authenticate using SASL (unknown error) >> 16/03/19 18:05:36 INFO server.NIOServerCnxnFactory: Accepted socket >> connection from /172.24.18.199:61815 >> 16/03/19 18:05:36 INFO zookeeper.ClientCnxn: Socket connection >> established to 172.24.18.199/172.24.18.199:22181, initiating session >> 16/03/19 18:05:36 INFO server.ZooKeeperServer: Client attempting to >> establish new session at /172.24.18.199:61815 >> 16/03/19 18:05:36 INFO persistence.FileTxnLog: Creating new log file: >> log.1 >> 16/03/19 18:05:36 INFO server.ZooKeeperServer: Established session >> 0x15390e97a2a0000 with negotiated timeout 600000 for client / >> 172.24.18.199:61815 >> 16/03/19 18:05:36 INFO zookeeper.ClientCnxn: Session establishment >> complete on server 172.24.18.199/172.24.18.199:22181, sessionid = >> 0x15390e97a2a0000, negotiated timeout = 600000 >> 16/03/19 18:05:36 INFO bsp.BspService: process: Asynchronous connection >> complete. >> 16/03/19 18:05:36 INFO yarn.GiraphYarnTask: [STATUS: task-0] >> MASTER_ZOOKEEPER_ONLY starting... >> 16/03/19 18:05:36 INFO graph.GraphTaskManager: map: No need to do >> anything when not a worker >> 16/03/19 18:05:36 INFO graph.GraphTaskManager: cleanup: Starting for >> MASTER_ZOOKEEPER_ONLY >> 16/03/19 18:05:36 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x1 zxid:0x2 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_masterElectionDir >> Error:KeeperErrorCode = NoNode for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_masterElectionDir >> 16/03/19 18:05:36 INFO master.BspServiceMaster: becomeMaster: First child >> is >> '/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_masterElectionDir/172.24.18.199_00000000000' >> and my bid is >> '/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_masterElectionDir/172.24.18.199_00000000000' >> 16/03/19 18:05:36 INFO netty.NettyServer: NettyServer: Using execution >> group with 8 threads for requestFrameDecoder. >> 16/03/19 18:05:36 INFO Configuration.deprecation: mapred.map.tasks is >> deprecated. Instead, use mapreduce.job.maps >> 16/03/19 18:05:36 INFO netty.NettyServer: start: Started server >> communication server: /172.24.18.199:30000 with up to 16 threads on bind >> attempt 0 with sendBufferSize = 32768 receiveBufferSize = 524288 >> 16/03/19 18:05:36 INFO netty.NettyClient: NettyClient: Using execution >> handler with 8 threads after request-encoder. >> 16/03/19 18:05:36 INFO master.BspServiceMaster: becomeMaster: I am now >> the master! >> 16/03/19 18:05:36 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0xe zxid:0x9 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0 >> Error:KeeperErrorCode = NoNode for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0 >> 16/03/19 18:05:36 INFO bsp.BspService: process: applicationAttemptChanged >> signaled >> 16/03/19 18:05:36 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x16 zxid:0xc txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1 >> Error:KeeperErrorCode = NoNode for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1 >> 16/03/19 18:05:36 WARN bsp.BspService: process: Unknown and unprocessed >> event >> (path=/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir, >> type=NodeChildrenChanged, state=SyncConnected) >> 16/03/19 18:05:36 INFO yarn.GiraphYarnTask: [STATUS: task-0] >> MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 1 needed to >> start superstep -1 >> 16/03/19 18:06:06 INFO master.BspServiceMaster: checkWorkers: Only found >> 0 responses of 1 needed to start superstep -1. Reporting every 30000 >> msecs, 569971 more msecs left before giving up. >> 16/03/19 18:06:06 INFO master.BspServiceMaster: >> logMissingWorkersOnSuperstep: No response from partition 1 (could be master) >> 16/03/19 18:06:06 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x22 zxid:0x10 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> 16/03/19 18:06:06 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x23 zxid:0x11 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> 16/03/19 18:06:06 INFO yarn.GiraphYarnTask: [STATUS: task-0] >> MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 1 needed to >> start superstep -1 >> 16/03/19 18:06:36 INFO master.BspServiceMaster: checkWorkers: Only found >> 0 responses of 1 needed to start superstep -1. Reporting every 30000 >> msecs, 539950 more msecs left before giving up. >> 16/03/19 18:06:36 INFO master.BspServiceMaster: >> logMissingWorkersOnSuperstep: No response from partition 1 (could be master) >> 16/03/19 18:06:36 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x26 zxid:0x12 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> 16/03/19 18:06:36 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x27 zxid:0x13 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> 16/03/19 18:06:36 INFO yarn.GiraphYarnTask: [STATUS: task-0] >> MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 1 needed to >> start superstep -1 >> 16/03/19 18:07:06 INFO master.BspServiceMaster: checkWorkers: Only found >> 0 responses of 1 needed to start superstep -1. Reporting every 30000 >> msecs, 509938 more msecs left before giving up. >> 16/03/19 18:07:06 INFO master.BspServiceMaster: >> logMissingWorkersOnSuperstep: No response from partition 1 (could be master) >> 16/03/19 18:07:06 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x2a zxid:0x14 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> 16/03/19 18:07:06 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x2b zxid:0x15 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> 16/03/19 18:07:06 INFO yarn.GiraphYarnTask: [STATUS: task-0] >> MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 1 needed to >> start superstep -1 >> 16/03/19 18:07:36 INFO master.BspServiceMaster: checkWorkers: Only found >> 0 responses of 1 needed to start superstep -1. Reporting every 30000 >> msecs, 479927 more msecs left before giving up. >> 16/03/19 18:07:36 INFO master.BspServiceMaster: >> logMissingWorkersOnSuperstep: No response from partition 1 (could be master) >> 16/03/19 18:07:36 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x2e zxid:0x16 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> 16/03/19 18:07:36 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x2f zxid:0x17 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> 16/03/19 18:07:36 INFO yarn.GiraphYarnTask: [STATUS: task-0] >> MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 1 needed to >> start superstep -1 >> 16/03/19 18:08:06 INFO master.BspServiceMaster: checkWorkers: Only found >> 0 responses of 1 needed to start superstep -1. Reporting every 30000 >> msecs, 449916 more msecs left before giving up. >> 16/03/19 18:08:06 INFO master.BspServiceMaster: >> logMissingWorkersOnSuperstep: No response from partition 1 (could be master) >> 16/03/19 18:08:06 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x32 zxid:0x18 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir >> 16/03/19 18:08:06 INFO server.PrepRequestProcessor: Got user-level >> KeeperException when processing sessionid:0x15390e97a2a0000 type:create >> cxid:0x33 zxid:0x19 txntype:-1 reqpath:n/a Error >> Path:/_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> Error:KeeperErrorCode = NodeExists for >> /_hadoopBsp/giraph_yarn_application_1458425066569_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir >> 16/03/19 18:08:06 INFO yarn.GiraphYarnTask: [STATUS: task-0] >> MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 1 needed to >> start superstep -1 >> 16/03/19 18:08:36 INFO master.BspServiceMaster: checkWorkers: Only found >> 0 responses of 1 needed to start superstep -1. Reporting every 30000 >> msecs, 419894 more msecs left before giving up. >> ----------------------------------------End of >> Logs-------------------------------------------------------- >> >> Does this mean that it required 1 worker to start the -1th superstep but >> did not find any or is it something else? >> - If that is the case, I can confirm that I have a node (the only) which >> is healthy. >> >> Finally, how do I give more memory to the giraph job? >> >> >> *3. When I kill a long running job by either killing the application on >> GUI or by pressing control+c, why does my jps still show these :* >> anirudh:hadoop-2.7.2 Anirudh$ jps >> >> >> *9009 GiraphYarnTask8962 GiraphApplicationMaster*9267 Jps >> 8725 NameNode >> 8855 NodeManager >> 8764 DataNode >> 8815 ResourceManager >> >> Aren't they supposed to be killed as well? I ask this because if I run a >> new job at this time, the job is forever in ACCEPTED state. Only after >> those are killed, a fresh job does go to completion (my observation). >> >> Alright then, I am hoping for a reply which will address these issues. >> >> Thanks >> Anirudh >> > >