Re: Problem processing large graph
Hi Avery, Thanks for your reply. I did adjust the heap and container size to a higher value(3072MB and 4096MB respectively) and I am not running the out of core option as well. I am intermittently able to run the job with 200 mappers. At other times, I can run part of the data while the other part gets stalled. FYI, I am using netty without authentication. One thing that I have noticed, however, is that it mostly runs successfully when the queue is running almost idle or is comparatively free. On most instances when the queue is running more tasks or is over-allocated, my job stalls even when the required number of containers are allocated. Looking at the logs, I mostly find it stalling at Superstep 0 or 1 after finishing Superstep –1. Or sometimes, even at –1. Could there be some shared resources in the queue which are not enough for the job while the job runs on a loaded queue and can I configure some other value to make it run? Tripti Singh Tech Yahoo, Software Sys Dev Eng P: +91 080.30516197 M: +91 9611907150 Yahoo Software Development India Pvt. Ltd Torrey Pines Bangalore 560 071 [cid:5A227D25-C441-497E-87F7-79628F751CDF] From: Avery Ching ach...@apache.orgmailto:ach...@apache.org Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Wednesday, September 3, 2014 at 10:53 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Problem processing large graph Hi Tripti, Is there a chance you can use higher memory machines so you don't run out of core? We do it this way at Facebook. We've haven't tested the out-of-core option. Avery On 8/31/14, 2:34 PM, Tripti Singh wrote: Hi, I am able to successfully build hadoop_yarn profile for running Giraph 1.1. I am also able to test run Connected Components on a small dataset. However, I am seeing 2 issues while running on a bigger dataset with 400 mappers: 1. I am unable to use out of Core Graph option. It errors out saying that it cannot read INIT partition. (Sorry I don’t have the log currently but I will share after I run that again). I am expecting that if the out of Core option is fixed, I should be able to run the workflow with less mappers. 2. In order to run the workflow anyhow, I removed the out of Core option and adjusted the heap size. This also runs with smaller dataset but fails with huge dataset. Worker logs are mostly empty. Non-empty logs end like this: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition [STATUS: task-374] setup: Beginning worker setup. setup: Log level remains at info [STATUS: task-374] setup: Initializing Zookeeper services. mapred.job.id is deprecated. Instead, use mapreduce.job.id job.local.dir is deprecated. Instead, use mapreduce.job.local.dir [STATUS: task-374] setup: Setting up Zookeeper manager. createCandidateStamp: Made the directory _bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614 createCandidateStamp: Made the directory _bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614/_zkServer createCandidateStamp: Creating my filestamp _bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614/_task/gsta33201.tan.ygrid.yahoo.com 374 getZooKeeperServerList: For task 374, got file 'null' (polling period is 3000) Master log has log statements for launching the container, opening proxy and processing event like this: Opening proxy : gsta31118.tan.ygrid.yahoo.com:8041 Processing Event EventType: QUERY_CONTAINER for Container container_1407992474095_708614_01_000314 …… I am not using SASL authentication. Any idea what might be wrong? Thanks, Tripti.
Re: Problem processing large graph
Hi Tripti, How many machines are you running on? The ideal configuration would be one worker per machine and one separate machine for the master. If you're using more mappers than machines then you're using more resources than necessary, and fixing that could help. Best, Matthew El 11/09/2014 13:39, Tripti Singh tri...@yahoo-inc.com escribió: Hi Avery, Thanks for your reply. I did adjust the heap and container size to a higher value(3072MB and 4096MB respectively) and I am not running the out of core option as well. I am intermittently able to run the job with 200 mappers. At other times, I can run part of the data while the other part gets stalled. FYI, I am using netty without authentication. One thing that I have noticed, however, is that it mostly runs successfully when the queue is running almost idle or is comparatively free. On most instances when the queue is running more tasks or is over-allocated, my job stalls even when the required number of containers are allocated. Looking at the logs, I mostly find it stalling at Superstep 0 or 1 after finishing Superstep –1. Or sometimes, even at –1. Could there be some shared resources in the queue which are not enough for the job while the job runs on a loaded queue and can I configure some other value to make it run? Tripti Singh Tech Yahoo, Software Sys Dev Eng P: +91 080.30516197 M: +91 9611907150 Yahoo Software Development India Pvt. Ltd Torrey Pines Bangalore 560 071 From: Avery Ching ach...@apache.org Reply-To: user@giraph.apache.org user@giraph.apache.org Date: Wednesday, September 3, 2014 at 10:53 PM To: user@giraph.apache.org user@giraph.apache.org Subject: Re: Problem processing large graph Hi Tripti, Is there a chance you can use higher memory machines so you don't run out of core? We do it this way at Facebook. We've haven't tested the out-of-core option. Avery On 8/31/14, 2:34 PM, Tripti Singh wrote: Hi, I am able to successfully build hadoop_yarn profile for running Giraph 1.1. I am also able to test run Connected Components on a small dataset. However, I am seeing 2 issues while running on a bigger dataset with 400 mappers: 1. I am unable to use out of Core Graph option. It errors out saying that it cannot read INIT partition. (Sorry I don’t have the log currently but I will share after I run that again). I am expecting that if the out of Core option is fixed, I should be able to run the workflow with less mappers. 2. In order to run the workflow anyhow, I removed the out of Core option and adjusted the heap size. This also runs with smaller dataset but fails with huge dataset. Worker logs are mostly empty. Non-empty logs end like this: *mapred.task.partition is deprecated. Instead, use mapreduce.task.partition [STATUS: task-374] setup: Beginning worker setup. setup: Log level remains at info [STATUS: task-374] setup: Initializing Zookeeper services. mapred.job.id http://mapred.job.id is deprecated. Instead, use mapreduce.job.id http://mapreduce.job.id job.local.dir is deprecated. Instead, use mapreduce.job.local.dir [STATUS: task-374] setup: Setting up Zookeeper manager. createCandidateStamp: Made the directory _bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614 createCandidateStamp: Made the directory _bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614/_zkServer createCandidateStamp: Creating my filestamp _bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614/_task/gsta33201.tan.ygrid.yahoo.com http://gsta33201.tan.ygrid.yahoo.com 374 getZooKeeperServerList: For task 374, got file 'null' (polling period is 3000) * Master log has log statements for launching the container, opening proxy and processing event like this: *Opening proxy : gsta31118.tan.ygrid.yahoo.com:8041 http://gsta31118.tan.ygrid.yahoo.com:8041 Processing Event EventType: QUERY_CONTAINER for Container container_1407992474095_708614_01_000314 ……* I am not using SASL authentication. Any idea what might be wrong? Thanks, Tripti.
Lockup During Edge Saving
Now that I have the loading and computation completing successfully, I am having issues when saving the edges back to disk. During the saving step, the machines will get to ~1-2 partitions before the cluster freezes up entirely (as in, I can't even SSH into the machine or view the Hadoop web console). As in my message before, I have about 1.3 billion edges total (600 million undirected, converted using the reverser) and a cluster of 19 machines, each with 8 cores and 60 GB of RAM. I am also using a custom linked-list based OutEdges class because of the computation's high number of mutations of edge values (the byte array/big data byte array was not efficient for this use case). The specific computation I am running has three supersteps (0, 1, 2), and during supersteps 1 and 2 there is extremely high RAM usage (~97%), but the steps do complete. During saving this high RAM usage is maintained and does not increase significantly until the cluster freezes up. When saving the edges (I am using a custom edge output format as well, that is basically a CSV), are they flushed to disk immediately/in batches or is the entire output file held in memory before being flushed? If the latter, this seems like it might cause the same sort of behavior I see. Also, if this is the case, is there a way this can be changed? If this doesn't seem like the issue, does anyone have any ideas what may be causing the lockup? Thanks in advance! -- Andrew