which type of input graph in connected components example

2013-11-22 Thread Silvio Di gregorio
Hi,
after i have lanched the connected components example inside giraph 1.0
distribution,
i have notice some (or better, many) nodes isolated, that is, node_id with
the same node_ID as Min_ID (used like label of the connected component);
indeed those isolated Nodes are foils of my directed graph and are
connected with a sub-graph.
My adjacency list also includes this foils:
node_idtabneighbor1tab.tabneighborN
foil1tabnothing
node_idtabneighbor1tab.tabneighborN
foil1tabnothing
foil1tabnothing

what can i do for including these foils in its connected components?
i have the base file used to build the adjacency list in this manner:
node1  node2
node12 node2
node13 node2
node3   node4

it is directed, i can replicate the inverse relation to have undirected
graph as a result or other
Can you help me?
thanks in advance


Giraph 1.0.0 - Netty port allocation

2013-11-22 Thread Larry Compton
My teammates and I are running Giraph on a cluster where a firewall is
configured on each compute node. We had 100 ports opened on the compute
nodes, which we thought would be more than enough to accommodate a large
number of workers. However, we're unable to go beyond about 90 workers with
our Giraph jobs, due to Netty ports being allocated outside of the range
(3-30100). We're not sure why this is happening. We shouldn't be
running more than one worker per compute node, so we were assuming that
only port 3 would be used, but we're routinely seeing Giraph try to use
ports greater than 30100 when we request close to 100 workers. This leads
us to believe that a simple one up numbering scheme is being used that
doesn't take the host into consideration, although this is only speculation.

Is there a way around this problem? Our system admins understandably balked
at opening 1000 ports.

Larry


Re: Giraph 1.0.0 - Netty port allocation

2013-11-22 Thread Avery Ching
The port logic is a bit complex, but all encapsulated in 
NettyServer.java (see below).


If nothing else is running on those ports and you really only have one 
giraph worker per port you should be good to go.  Can you look at the 
logs for the worker that is trying to start a port other than base port 
+ taskId?



int taskId = conf.getTaskPartition();
int numTasks = conf.getInt(mapred.map.tasks, 1);
// Number of workers + 1 for master
int numServers = conf.getInt(GiraphConstants.MAX_WORKERS, numTasks) 
+ 1;

int portIncrementConstant =
(int) Math.pow(10, Math.ceil(Math.log10(numServers)));
int bindPort = GiraphConstants.IPC_INITIAL_PORT.get(conf) + taskId;
int bindAttempts = 0;
final int maxIpcPortBindAttempts = 
MAX_IPC_PORT_BIND_ATTEMPTS.get(conf);

final boolean failFirstPortBindingAttempt =
GiraphConstants.FAIL_FIRST_IPC_PORT_BIND_ATTEMPT.get(conf);

// Simple handling of port collisions on the same machine while
// preserving debugability from the port number alone.
// Round up the max number of workers to the next power of 10 and use
// it as a constant to increase the port number with.
while (bindAttempts  maxIpcPortBindAttempts) {
  this.myAddress = new InetSocketAddress(localHostname, bindPort);
  if (failFirstPortBindingAttempt  bindAttempts == 0) {
if (LOG.isInfoEnabled()) {
  LOG.info(start: Intentionally fail first  +
  binding attempt as giraph.failFirstIpcPortBindAttempt  +
  is true, port  + bindPort);
}
++bindAttempts;
bindPort += portIncrementConstant;
continue;
  }

  try {
Channel ch = bootstrap.bind(myAddress);
accepted.add(ch);

break;
  } catch (ChannelException e) {
LOG.warn(start: Likely failed to bind on attempt  +
bindAttempts +  to port  + bindPort, e);
++bindAttempts;
bindPort += portIncrementConstant;
  }
}
if (bindAttempts == maxIpcPortBindAttempts || myAddress == null) {
  throw new IllegalStateException(
  start: Failed to start NettyServer with  +
  bindAttempts +  attempts);
}



On 11/22/13 9:15 AM, Larry Compton wrote:
My teammates and I are running Giraph on a cluster where a firewall is 
configured on each compute node. We had 100 ports opened on the 
compute nodes, which we thought would be more than enough to 
accommodate a large number of workers. However, we're unable to go 
beyond about 90 workers with our Giraph jobs, due to Netty ports being 
allocated outside of the range (3-30100). We're not sure why this 
is happening. We shouldn't be running more than one worker per compute 
node, so we were assuming that only port 3 would be used, but 
we're routinely seeing Giraph try to use ports greater than 30100 when 
we request close to 100 workers. This leads us to believe that a 
simple one up numbering scheme is being used that doesn't take the 
host into consideration, although this is only speculation.


Is there a way around this problem? Our system admins understandably 
balked at opening 1000 ports.


Larry




Re: Giraph 1.0.0 - Netty port allocation

2013-11-22 Thread Larry Compton
Avery,

Thanks for the clarification. I'll look into adding the configuration
option. I'll see about providing a patch, if we go down that path.

Larry


On Fri, Nov 22, 2013 at 2:23 PM, Avery Ching ach...@apache.org wrote:

  The reason is actually simple.  If you run more than one Giraph worker
 per machine, there will be a port conflict.  Worse yet, imagine multiple
 Giraph jobs running simultaneously running on a cluster, hence we have the
 increase port strategy.  It would be straightforward to add a configurable
 option to use a single port though for situations such as yours though
 (especially since you know where the code is now).

 Avery


 On 11/22/13 11:19 AM, Larry Compton wrote:

  Avery,

  It looks like the ports are being allocated the way we suspected (3 +
 task ID). That's a problem for us because we'll have to open a wide bank of
 ports (the SAs want to minimize open ports) and also keep them available
 for use by Giraph. Ideally, the port allocation would take the host into
 consideration. If you ask for 200 workers and they're each running on a
 different host, port 3 could be used by every Netty server. The way
 it's working now, a different port is being allocated per worker, which
 appears unnecessary. Is there a reason a different port is used per
 worker/task?

 Is this still the way ports are allocated in Giraph 1.1.0?

  Larry


 On Fri, Nov 22, 2013 at 1:18 PM, Avery Ching ach...@apache.org wrote:

 The port logic is a bit complex, but all encapsulated in NettyServer.java
 (see below).

 If nothing else is running on those ports and you really only have one
 giraph worker per port you should be good to go.  Can you look at the logs
 for the worker that is trying to start a port other than base port + taskId?


 int taskId = conf.getTaskPartition();
 int numTasks = conf.getInt(mapred.map.tasks, 1);
 // Number of workers + 1 for master
 int numServers = conf.getInt(GiraphConstants.MAX_WORKERS, numTasks) +
 1;
 int portIncrementConstant =
 (int) Math.pow(10, Math.ceil(Math.log10(numServers)));
 int bindPort = GiraphConstants.IPC_INITIAL_PORT.get(conf) + taskId;
 int bindAttempts = 0;
 final int maxIpcPortBindAttempts =
 MAX_IPC_PORT_BIND_ATTEMPTS.get(conf);
 final boolean failFirstPortBindingAttempt =
 GiraphConstants.FAIL_FIRST_IPC_PORT_BIND_ATTEMPT.get(conf);

 // Simple handling of port collisions on the same machine while
 // preserving debugability from the port number alone.
 // Round up the max number of workers to the next power of 10 and use
 // it as a constant to increase the port number with.
 while (bindAttempts  maxIpcPortBindAttempts) {
   this.myAddress = new InetSocketAddress(localHostname, bindPort);
   if (failFirstPortBindingAttempt  bindAttempts == 0) {
 if (LOG.isInfoEnabled()) {
   LOG.info(start: Intentionally fail first  +
   binding attempt as giraph.failFirstIpcPortBindAttempt  +
   is true, port  + bindPort);
 }
 ++bindAttempts;
 bindPort += portIncrementConstant;
 continue;
   }

   try {
 Channel ch = bootstrap.bind(myAddress);
 accepted.add(ch);

 break;
   } catch (ChannelException e) {
 LOG.warn(start: Likely failed to bind on attempt  +
 bindAttempts +  to port  + bindPort, e);
 ++bindAttempts;
 bindPort += portIncrementConstant;
   }
 }
 if (bindAttempts == maxIpcPortBindAttempts || myAddress == null) {
   throw new IllegalStateException(
   start: Failed to start NettyServer with  +
   bindAttempts +  attempts);

 }



 On 11/22/13 9:15 AM, Larry Compton wrote:

 My teammates and I are running Giraph on a cluster where a firewall is
 configured on each compute node. We had 100 ports opened on the compute
 nodes, which we thought would be more than enough to accommodate a large
 number of workers. However, we're unable to go beyond about 90 workers with
 our Giraph jobs, due to Netty ports being allocated outside of the range
 (3-30100). We're not sure why this is happening. We shouldn't be
 running more than one worker per compute node, so we were assuming that
 only port 3 would be used, but we're routinely seeing Giraph try to use
 ports greater than 30100 when we request close to 100 workers. This leads
 us to believe that a simple one up numbering scheme is being used that
 doesn't take the host into consideration, although this is only speculation.

 Is there a way around this problem? Our system admins understandably
 balked at opening 1000 ports.

 Larry