Problem solved, i optimize the proccesing of each message, and i could solve it.
Sorry for the spam guys :D Bye! Jose 2016-08-28 15:23 GMT-03:00 José Luis Larroque <[email protected]>: > Ok, i understand what is happening now. > > I starting to use more compute threads, because i believed that the > problem was scalability. I started the application again, using : > giraph.numComputeThreads=15 (r3.8xlarge has 32 cores) > giraph.userPartitionCount=240 (4 for each computing thread) > > The application gets stuck on only one thread, and only in one partition. > In this partition, i'm doing a small processing of each message. I have to > add the vertex id to the end of each message, in order to have the result > for the Output of that vertex. > > The problem here remains in that small process of each message is taking > to long, and i have the entire cluster waiting for it. I Know that there > are other tecnologies por post-processing results, maybe i should use one > of them? > > Bye! > Jose > > 2016-08-27 21:33 GMT-03:00 José Luis Larroque <[email protected]>: > >> Using giraph.maxNumberOfOpenRequests and >> giraph.waitForRequestsConfirmation=true >> didn't solve the problem. >> >> I duplicated the netty threads, and assigned the double of the original >> size to netty buffers, and no change. >> >> I condensed the messages, 1000 into 1, and get a lot of less messages, >> but still, same final results. >> >> Please, help. >> >> 2016-08-26 21:24 GMT-03:00 José Luis Larroque <[email protected]>: >> >>> Hi again guys! >>> >>> I'm doing BFS search through the Wikipedia (spanish edition) site. I >>> converted the dump <https://dumps.wikimedia.org/eswiki/20160601/> ( >>> https://dumps.wikimedia.org/eswiki/20160601) into a file that could be >>> read with Giraph. >>> >>> The BFS is searching for paths, and its all ok until get stuck in some >>> point of the superstep four. >>> >>> I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each >>> node is a r3.8xlarge ec2 instance. The command for executing the BFS is >>> this one: >>> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar >>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNaveg >>> acionalesWikiquote -vif ar.edu.info.unlp.tesina.vertic >>> e.estructuras.IdTextWithComplexValueInputFormat -vip >>> /user/hduser/input/grafo-wikipedia.txt -vof >>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat >>> -op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca >>> giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true, >>> giraph.maxMessagesInMemory=1000000000,giraph.isStaticGraph=true, >>> *giraph.logLevel=Debug* >>> >>> Each container have 120GB (almost). I'm using 1000M messages limit in >>> outOfCore, because i believed that was the problem, but apparently is not. >>> >>> This ones are the master logs (it seems that is waiting for workers for >>> finish but they just don't...and keeps like this forever...): >>> >>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] >>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 >>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: >>> Got finished worker list = [], size = 0, worker list = >>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), >>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), >>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], >>> size = 4 from /_hadoopBsp/giraph_yarn_applic >>> ation_1472168758138_0002/_applicationAttemptsDir/0/_superste >>> pDir/4/_workerFinishedDir >>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] >>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 >>> >>> *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for >>> 1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed >>> signaled of false* >>> ...thirty times same last two lines... >>> ... >>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] >>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 >>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: >>> Got finished worker list = [], size = 0, worker list = >>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), >>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), >>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], >>> size = 4 from /_hadoopBsp/giraph_yarn_applic >>> ation_1472168758138_0002/_applicationAttemptsDir/0/_superste >>> pDir/4/_workerFinishedDir >>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] >>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 >>> >>> And in *all* workers, there is no information on what is happening (i'm >>> testing this with *giraph.logLevel=Debug* because with the default >>> level of giraph log i was lost), and the workers say this over and over >>> again: >>> >>> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result >>> not ready yet java.util.concurrent.FutureTask@7392f34d >>> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for >>> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82 >>> >>> Before starting the superstep 4, the information on each worker was the >>> following one >>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] >>> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4 >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep: >>> addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal, >>> MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, >>> MRtaskID >>> =1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, >>> MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, >>> MRtaskID=4, port=30004)] >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0 >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 >>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 >>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3 >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4 >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5 >>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 >>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8 >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9 >>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10 >>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13 >>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14 >>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) >>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15 >>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) >>> 16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory >>> (free/total/max) = 92421.41M / 115000.00M / 115000.00M >>> >>> >>> I don't know what is exactly failing: >>> - i know that all containers have memory available, on datanodes i check >>> that each one had like 50 GB available. >>> - I'm not sure if i'm hitting some sort of limit in the use of >>> outOfCore. I know that writing messages too fast is dangerous with 1.1 >>> version of Giraph, but if i hit that limit, i suppose that the container >>> will fail, right? >>> - Maybe the connections for zookeeper client aren't enough? I read that >>> maybe the 60 default value in zookeeper for *maxClientCnxns* is too >>> small for a context like AWS, but i'm not fully aware of the relationship >>> between Giraph and Zookeeper for start changing default configuration values >>> - Maybe i have to tune outOfCore configuration? Using >>> giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirma >>> tion=true like someone recommend here (http://mail-archives.apache.o >>> rg/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B%25majak >>> [email protected]%3E) ? >>> - Should i tune the netty configuration? I have the default >>> configuration, but i believe that maybe using only 8 netty client and 8 >>> server threads will be enough, since that i have only a few workers and >>> maybe too much threads of netty are making the overhead that is doing that >>> entire application get stuck >>> - Using giraph.useBigDataIOForMessages=true didn't help me either, i >>> know that each vertex is receiving 100 M or more messages and that property >>> should be helpful, but didn't make any difference anyway >>> >>> As you maybe are suspecting, i have too many hypothesis, that's why i'm >>> seeking for help, so i can go in the right direction. >>> >>> Any help would be greatly appreciated. >>> >>> Bye! >>> Jose >>> >>> >>> >>> >>> >> >
