which type of input graph in connected components example
Hi, after i have lanched the connected components example inside giraph 1.0 distribution, i have notice some (or better, many) nodes isolated, that is, node_id with the same node_ID as Min_ID (used like label of the connected component); indeed those isolated Nodes are foils of my directed graph and are connected with a sub-graph. My adjacency list also includes this foils: node_idtabneighbor1tab.tabneighborN foil1tabnothing node_idtabneighbor1tab.tabneighborN foil1tabnothing foil1tabnothing what can i do for including these foils in its connected components? i have the base file used to build the adjacency list in this manner: node1 node2 node12 node2 node13 node2 node3 node4 it is directed, i can replicate the inverse relation to have undirected graph as a result or other Can you help me? thanks in advance
Giraph 1.0.0 - Netty port allocation
My teammates and I are running Giraph on a cluster where a firewall is configured on each compute node. We had 100 ports opened on the compute nodes, which we thought would be more than enough to accommodate a large number of workers. However, we're unable to go beyond about 90 workers with our Giraph jobs, due to Netty ports being allocated outside of the range (3-30100). We're not sure why this is happening. We shouldn't be running more than one worker per compute node, so we were assuming that only port 3 would be used, but we're routinely seeing Giraph try to use ports greater than 30100 when we request close to 100 workers. This leads us to believe that a simple one up numbering scheme is being used that doesn't take the host into consideration, although this is only speculation. Is there a way around this problem? Our system admins understandably balked at opening 1000 ports. Larry
Re: Giraph 1.0.0 - Netty port allocation
The port logic is a bit complex, but all encapsulated in NettyServer.java (see below). If nothing else is running on those ports and you really only have one giraph worker per port you should be good to go. Can you look at the logs for the worker that is trying to start a port other than base port + taskId? int taskId = conf.getTaskPartition(); int numTasks = conf.getInt(mapred.map.tasks, 1); // Number of workers + 1 for master int numServers = conf.getInt(GiraphConstants.MAX_WORKERS, numTasks) + 1; int portIncrementConstant = (int) Math.pow(10, Math.ceil(Math.log10(numServers))); int bindPort = GiraphConstants.IPC_INITIAL_PORT.get(conf) + taskId; int bindAttempts = 0; final int maxIpcPortBindAttempts = MAX_IPC_PORT_BIND_ATTEMPTS.get(conf); final boolean failFirstPortBindingAttempt = GiraphConstants.FAIL_FIRST_IPC_PORT_BIND_ATTEMPT.get(conf); // Simple handling of port collisions on the same machine while // preserving debugability from the port number alone. // Round up the max number of workers to the next power of 10 and use // it as a constant to increase the port number with. while (bindAttempts maxIpcPortBindAttempts) { this.myAddress = new InetSocketAddress(localHostname, bindPort); if (failFirstPortBindingAttempt bindAttempts == 0) { if (LOG.isInfoEnabled()) { LOG.info(start: Intentionally fail first + binding attempt as giraph.failFirstIpcPortBindAttempt + is true, port + bindPort); } ++bindAttempts; bindPort += portIncrementConstant; continue; } try { Channel ch = bootstrap.bind(myAddress); accepted.add(ch); break; } catch (ChannelException e) { LOG.warn(start: Likely failed to bind on attempt + bindAttempts + to port + bindPort, e); ++bindAttempts; bindPort += portIncrementConstant; } } if (bindAttempts == maxIpcPortBindAttempts || myAddress == null) { throw new IllegalStateException( start: Failed to start NettyServer with + bindAttempts + attempts); } On 11/22/13 9:15 AM, Larry Compton wrote: My teammates and I are running Giraph on a cluster where a firewall is configured on each compute node. We had 100 ports opened on the compute nodes, which we thought would be more than enough to accommodate a large number of workers. However, we're unable to go beyond about 90 workers with our Giraph jobs, due to Netty ports being allocated outside of the range (3-30100). We're not sure why this is happening. We shouldn't be running more than one worker per compute node, so we were assuming that only port 3 would be used, but we're routinely seeing Giraph try to use ports greater than 30100 when we request close to 100 workers. This leads us to believe that a simple one up numbering scheme is being used that doesn't take the host into consideration, although this is only speculation. Is there a way around this problem? Our system admins understandably balked at opening 1000 ports. Larry
Re: Giraph 1.0.0 - Netty port allocation
Avery, Thanks for the clarification. I'll look into adding the configuration option. I'll see about providing a patch, if we go down that path. Larry On Fri, Nov 22, 2013 at 2:23 PM, Avery Ching ach...@apache.org wrote: The reason is actually simple. If you run more than one Giraph worker per machine, there will be a port conflict. Worse yet, imagine multiple Giraph jobs running simultaneously running on a cluster, hence we have the increase port strategy. It would be straightforward to add a configurable option to use a single port though for situations such as yours though (especially since you know where the code is now). Avery On 11/22/13 11:19 AM, Larry Compton wrote: Avery, It looks like the ports are being allocated the way we suspected (3 + task ID). That's a problem for us because we'll have to open a wide bank of ports (the SAs want to minimize open ports) and also keep them available for use by Giraph. Ideally, the port allocation would take the host into consideration. If you ask for 200 workers and they're each running on a different host, port 3 could be used by every Netty server. The way it's working now, a different port is being allocated per worker, which appears unnecessary. Is there a reason a different port is used per worker/task? Is this still the way ports are allocated in Giraph 1.1.0? Larry On Fri, Nov 22, 2013 at 1:18 PM, Avery Ching ach...@apache.org wrote: The port logic is a bit complex, but all encapsulated in NettyServer.java (see below). If nothing else is running on those ports and you really only have one giraph worker per port you should be good to go. Can you look at the logs for the worker that is trying to start a port other than base port + taskId? int taskId = conf.getTaskPartition(); int numTasks = conf.getInt(mapred.map.tasks, 1); // Number of workers + 1 for master int numServers = conf.getInt(GiraphConstants.MAX_WORKERS, numTasks) + 1; int portIncrementConstant = (int) Math.pow(10, Math.ceil(Math.log10(numServers))); int bindPort = GiraphConstants.IPC_INITIAL_PORT.get(conf) + taskId; int bindAttempts = 0; final int maxIpcPortBindAttempts = MAX_IPC_PORT_BIND_ATTEMPTS.get(conf); final boolean failFirstPortBindingAttempt = GiraphConstants.FAIL_FIRST_IPC_PORT_BIND_ATTEMPT.get(conf); // Simple handling of port collisions on the same machine while // preserving debugability from the port number alone. // Round up the max number of workers to the next power of 10 and use // it as a constant to increase the port number with. while (bindAttempts maxIpcPortBindAttempts) { this.myAddress = new InetSocketAddress(localHostname, bindPort); if (failFirstPortBindingAttempt bindAttempts == 0) { if (LOG.isInfoEnabled()) { LOG.info(start: Intentionally fail first + binding attempt as giraph.failFirstIpcPortBindAttempt + is true, port + bindPort); } ++bindAttempts; bindPort += portIncrementConstant; continue; } try { Channel ch = bootstrap.bind(myAddress); accepted.add(ch); break; } catch (ChannelException e) { LOG.warn(start: Likely failed to bind on attempt + bindAttempts + to port + bindPort, e); ++bindAttempts; bindPort += portIncrementConstant; } } if (bindAttempts == maxIpcPortBindAttempts || myAddress == null) { throw new IllegalStateException( start: Failed to start NettyServer with + bindAttempts + attempts); } On 11/22/13 9:15 AM, Larry Compton wrote: My teammates and I are running Giraph on a cluster where a firewall is configured on each compute node. We had 100 ports opened on the compute nodes, which we thought would be more than enough to accommodate a large number of workers. However, we're unable to go beyond about 90 workers with our Giraph jobs, due to Netty ports being allocated outside of the range (3-30100). We're not sure why this is happening. We shouldn't be running more than one worker per compute node, so we were assuming that only port 3 would be used, but we're routinely seeing Giraph try to use ports greater than 30100 when we request close to 100 workers. This leads us to believe that a simple one up numbering scheme is being used that doesn't take the host into consideration, although this is only speculation. Is there a way around this problem? Our system admins understandably balked at opening 1000 ports. Larry