Re: Handling vertices with huge number of outgoing edges (and other optimization).

2013-09-25 Thread Marco Aurelio Barbosa Fagnani Lotz
Hello Alok,

about the question 3.a, i guess the framework will indeed try to allocate the 
local workers.
Each worker is actually a map only task. Due to the behaviour of the Hadoop 
framework, it will aim for data locality. Therefore, the framework will try to 
run the map tasks (and thus the workers) in nodes that have local blocks.

Best regards,
Marco Aurelio Lotz

Sent from my iPhone

On 17 Sep 2013, at 18:19, "kumbh...@usc.edu" 
mailto:kumbh...@usc.edu>> wrote:

Hi,
We have a moderately sized data set (~20M vertices, ~45M edges).

We are running several basic algorithms e.g connected components, single source 
shortest paths, page rank etc. The setup includes 12 nodes with 8 core, 16GB 
ram each. We allow max three mappers per node (-Xmx5G) and run upto 24 giraph 
workers for the experiments. We are using the trunk version, pulled on 9/2 from 
the github on hadoop 1.2.1. We use HDFS data store (the file is ~980 MB, with 
64MB block size, we get around 15 HDFS blocks)

Input data is in an adjacency list, json format. We use the built in 
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat as the 
input format.

Given this setup, we have a few questions and appreciate any help to optimize 
the execution:


  1.
We observed that the dataset contains most of the vertices (>90%) with out 
degree < 20, and some have between 20-1000. However there a few vertices 
(<0.5%) with a very high out degree (>100,000).
 *
Due to this, most of the workers load data fairly quickly (~20-30secs), however 
a couple of workers take a much longer time (~800secs) to complete just the 
data input step. Is there a way to handle such vertices? Or do you suggest 
using any other input format?
  2.  Another question we have is if, in general, there is a guide for choosing 
various optimization parameters?
 *
e.g. number of input/compute/output threads
  3.
Data Locality and in memory messages:
 *
Is there any data locality attempt while running worker? Basically, out of 12 
nodes, if the HDFS blocks for a file are stored only on say 8 nodes and I run 8 
workers, is it guaranteed that the workers will run on those 8 nodes?
 *
Also, if the vertices are located on the same worker, do we have in memory 
message transfer between those vertices.
  4.
Partitioning: We wish to study the effect of different partitioning schemes on 
the runtime.
 *
Is there a Partitioner we can use that will try to collocate neighboring 
vertices on the same worker while balancing different partitions? (Basically a 
METIS Partitioner)
 *
If we do pre-processing of the data file and store neighboring vertices close 
to each other in the file, implying different HDFS blocks will approximately 
contain neighboring vertices, and use the default giaph partitioner, will that 
help?


I know this is a long mail, and we truly appreciate your help.

Thanks,
Alok

Sent from Windows Mail



RE: Dynamic Graphs

2013-09-05 Thread Marco Aurelio Barbosa Fagnani Lotz
Hello all,

Answering Mr. Kampf question: In my personal opinion this tool would be indeed 
really useful, since many of the real-world graphs are dynamic.
I have just finished a report of my research in the subject. The report is 
available at:

https://github.com/MarcoLotz/dynamicGraph/blob/master/LotzReport.pdf?raw=true

There is a first application that can do this injection. I am working in the 
minor modifications that are proposed in the document right now. It is 
described in section 2.7
The previous sections just describes some experiences that I had with Giraph 
and an introduction to the scenario.

Best Regards,
Marco Lotz

From: Mirko Kämpf 
Sent: 25 August 2013 07:55
To: user@giraph.apache.org
Subject: Re: Dynamic Graphs

Good morning Gentlemen,

as far as I understand your thread you are talking about the same topic I was 
thinking and working some time.
I work on a research project focused on evolution of networks and networks 
dynamics in networks of networks.

My understanding of Marco's question is, that he needs to change node 
properties or even wants to add nodes to the graph while it is processed, right?

With the WorkerContext we could construct a "Connector" to the outside world, 
not just for loading data from HDFS, which requires a preprocessing step for 
the data which has to be loaded also. I think about HBase often. All my nodes 
and edges live in HBase. From there it is quite easy to load new data based on 
a simple "Scan" or even if the WorkerContext triggers a Hive or Pig script, one 
can automatically reorganize or extract relevant new links / nodes which have 
to be added to the graph.

Such an approach means, after n super steps of the Giraph layer an additional 
utility-step (triggered via WorkerContext, or any other better fitting class 
form Giraph - not sure jet there to start) is executed. Before such a step the 
state of the graph is persisted to allow fall back or resume. The utility-step 
can be a processing (MR, Mahout) or just a load (from HDFS, HBase) operation 
and it allows a kind of clocked data flow directly into a running Giraph 
application. I think this is a very important feature in Complex Systems 
research, as we have interacting layers which change in parallel. In this 
picture the Giraph steps are the steps of layer A, lets say something whats 
going on on top of a network and the utility-step expresses the changes in the 
underlying structure affecting the network it self but based on the data / 
properties of the second subsystem, e.g. the agents operating on top of the 
network.

I created a tool, which worked like this - but not at scale - and it was at a 
time before Giraph. What do you think, is there a need for such a kind of 
extension in the Giraph world?

Have a nice Sunday.

Best wishes
Mirko

--
--
Mirko Kämpf

Trainer @ Cloudera

tel: +49 176 20 63 51 99
skype: kamir1604
mi...@cloudera.com<mailto:mi...@cloudera.com>



On Wed, Aug 21, 2013 at 3:30 PM, Claudio Martella 
mailto:claudio.marte...@gmail.com>> wrote:
As I said, the injection of the new vertices/edges would have to be done 
"manually", hence without any support of the infrastructure. I'd suggest you 
implement a WorkerContext class that supports the reading of a specific file 
with a specific format (under your control) from HDFS, and that is accessed by 
this particular "special" vertex (e.g. based on the vertex ID).

Does this make sense?


On Wed, Aug 21, 2013 at 2:13 PM, Marco Aurelio Barbosa Fagnani Lotz 
mailto:m.a.b.l...@stu12.qmul.ac.uk>> wrote:
Dear Mr. Martella,

Once achieved the conditions for updating the vertex data base, what it the 
best way for the Injector Vertex to call an input reader again?

I am able to access all the HDFS data, but I guess the vertex would need to 
have access to the input splits and also the vertex input format that I 
designate. Am I correct? Or there is a way that one can just ask Zookeeper to 
create new splits and distribute to the workers from given a path in DFS?

Best Regards,
Marco Lotz

From: Claudio Martella 
mailto:claudio.marte...@gmail.com>>
Sent: 14 August 2013 15:25
To: user@giraph.apache.org<mailto:user@giraph.apache.org>
Subject: Re: Dynamic Graphs

Hi Marco,

Giraph currently does not support that. One way of doing this would be by 
having a specific (pseudo-)vertex to act as the "injector" of the new vertices 
and edges For example, it would read a file from HDFS and call the mutable API 
during the computation, superstep after superstep.


On Wed, Aug 14, 2013 at 3:02 PM, Marco Aurelio Barbosa Fagnani Lotz 
mailto:m.a.b.l...@stu12.qmul.ac.uk>> wrote:
Hello all,

I would like to know if there is any form to use dynamic graphs with Giraph. By 
dynamic one can read graphs that may change while Giraph is 
computing/deliberating. The changes are in the input

RE: Dynamic Graphs

2013-08-21 Thread Marco Aurelio Barbosa Fagnani Lotz
Dear Mr. Martella,

Once achieved the conditions for updating the vertex data base, what it the 
best way for the Injector Vertex to call an input reader again?

I am able to access all the HDFS data, but I guess the vertex would need to 
have access to the input splits and also the vertex input format that I 
designate. Am I correct? Or there is a way that one can just ask Zookeeper to 
create new splits and distribute to the workers from given a path in DFS?

Best Regards,
Marco Lotz

From: Claudio Martella 
Sent: 14 August 2013 15:25
To: user@giraph.apache.org
Subject: Re: Dynamic Graphs

Hi Marco,

Giraph currently does not support that. One way of doing this would be by 
having a specific (pseudo-)vertex to act as the "injector" of the new vertices 
and edges For example, it would read a file from HDFS and call the mutable API 
during the computation, superstep after superstep.


On Wed, Aug 14, 2013 at 3:02 PM, Marco Aurelio Barbosa Fagnani Lotz 
mailto:m.a.b.l...@stu12.qmul.ac.uk>> wrote:
Hello all,

I would like to know if there is any form to use dynamic graphs with Giraph. By 
dynamic one can read graphs that may change while Giraph is 
computing/deliberating. The changes are in the input file and are not caused by 
the graph computation itself.

Is there any way to analyse it using Giraph? If not, anyone has any 
idea/suggestion if it is possible to modify the framework in order to process 
it?

Best Regards,
Marco Lotz



--
   Claudio Martella
   claudio.marte...@gmail.com<mailto:claudio.marte...@gmail.com>


New vertex allocation and messages

2013-08-19 Thread Marco Aurelio Barbosa Fagnani Lotz
Hello all :)

I am programming an application that has to create and destroy a few vertices. 
I was wondering if there is any protection in Giraph to prevent a vertex to 
send a message to another vertex that does not exist (i.e. provide a vertex id 
that is not associated with a vertex yet).

Is there a way to test if the destination vertex exists before sending the 
message to it?

Also, when a vertex is created, is there any source of load balancing or it is 
always kept in the worker that created it?

Best Regards,
Marco Lotz




RE: Workers input splits and MasterCompute communication

2013-08-19 Thread Marco Aurelio Barbosa Fagnani Lotz
Hello all :)

I am having problems calling getContext().getInputSplit(); inside the compute() 
method in the workers.

It always returns as if it didn't get any split at all, since 
inputSplit.getLocations() returns without the hosts that should have that split 
as local and inputSplit.getLength() returns 0.

Should there be any initialization to the Workers context so that I can get 
this information?
Is there anyway to access the jobContext from the workers or the Master?

Best Regards,
Marco Lotz


From: Marco Aurelio Barbosa Fagnani Lotz 
Sent: 17 August 2013 20:20
To: user@giraph.apache.org
Subject: Workers input splits and MasterCompute communication

Hello all :)

In what class the workers actually get the input file splits from the file 
system?

Is it possible to a MasterCompute class object to have access/communication 
with the workers in that job? I though about using aggregators, but then I 
assumed that aggregators actually work with vertices compute() (and related 
methods) and not with the worker itself.

When I mean workers I don't mean the vertices in each worker, but the object 
that runs the compute for all the vertices in that worker.

Best Regards,
Marco Lotz


Workers input splits and MasterCompute communication

2013-08-17 Thread Marco Aurelio Barbosa Fagnani Lotz
Hello all :)

In what class the workers actually get the input file splits from the file 
system?

Is it possible to a MasterCompute class object to have access/communication 
with the workers in that job? I though about using aggregators, but then I 
assumed that aggregators actually work with vertices compute() (and related 
methods) and not with the worker itself.

When I mean workers I don't mean the vertices in each worker, but the object 
that runs the compute for all the vertices in that worker.

Best Regards,
Marco Lotz


RE: Using Giraph at Facebook

2013-08-15 Thread Marco Aurelio Barbosa Fagnani Lotz
Great article! :)
Really interesting to see how you solved those many problems.

Btw, there's a small typo "in that help a variety of teams withchallenges they 
couldn’t solve with Hive/Hadoop or other custom systems in thepast."

Cheers,
Marco

From: Avery Ching 
Sent: 15 August 2013 00:55
To: user@giraph.apache.org
Subject: Using Giraph at Facebook

Hi Giraphers,

We recently released an article on we can use Giraph at the scale of a
trillion edges at Facebook.  If you're interested, please take a look!

https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920

Avery

Dynamic Graphs

2013-08-14 Thread Marco Aurelio Barbosa Fagnani Lotz
Hello all,

I would like to know if there is any form to use dynamic graphs with Giraph. By 
dynamic one can read graphs that may change while Giraph is 
computing/deliberating. The changes are in the input file and are not caused by 
the graph computation itself.

Is there any way to analyse it using Giraph? If not, anyone has any 
idea/suggestion if it is possible to modify the framework in order to process 
it?

Best Regards,
Marco Lotz


RE: Logger output

2013-08-12 Thread Marco Aurelio Barbosa Fagnani Lotz
Nevermind,

just solved the problem. The solution was just like Ashish described. I was 
having an error in a script that I was running, so I wasn't able to find the 
log.

Cheers,
Marco

From: Marco Aurelio Barbosa Fagnani Lotz 
Sent: 12 August 2013 13:20
To: user@giraph.apache.org
Subject: RE: Logger output

Thanks Ashish :)

I took a look in the directory HADOOP_BASE_PATH/logs/userlogs/job_number , but 
in the syslog there are no indications about the these logs. Right now I am 
running Giraph in a pseudo-distributed mode, so it should be in this machine.

I even tried to change from LOG.debug("") to LOG.info("") to see if it appears 
in the logs and it still didn't work. Am I missing something? Should I somehow 
initialize the LOG by a different method than just declaring it with

"private static final Logger LOG =
  Logger.getLogger(SimpleBFSComputation.class);"?

I am trying to log right now with:

"LOG.info("testinglog");"

Best Regards,
Marco Lotz

From: Ashish Jain 
Sent: 09 August 2013 18:48
To: user@giraph.apache.org
Subject: Re: Logger output

Hello Marco,

In my experiments, I have found the log output to be in the hadoop log file of 
the application. When you run your application, note down the job number. The 
hadoop log file is usually in HADOOP_BASE_PATH/logs/userlogs/job_number. In it 
you need to look at syslog, among the various lines interleaved will be the 
output of Log.

If you run your program on a cluster, you might have to find out on which node 
was the program run. One way is, if you use -op in your application, look at 
the log to see the cluster node name. Other way is to just check the 
HADOOP_BASE_PATH/logs/userlogs/job_number on all the nodes of your cluster. You 
will find output from the MasterThread and from one/more worker threads.

This is the approach I have used, there might be a better way to do this. Hope 
this helps.

Ashish


On Fri, Aug 9, 2013 at 4:43 AM, Marco Aurelio Barbosa Fagnani Lotz 
mailto:m.a.b.l...@stu12.qmul.ac.uk>> wrote:
Hello there! :)

I am writing a Giraph application but I could not find the output place for the 
logs.
Where is the default output path to see the logged info?

By log I mean the log that is inside a class that one creates:

private static final Logger LOG =
  Logger.getLogger(SimpleBFSComputation.class);

I call the following method to enable that log to debug:
Log.setLevel(Level.DEBUG);

And then write some random content in it:
if (LOG.isDebugEnabled){
LOG.debug("This is a logged line");}


Just to clarify, if I called the "Log.setLevel(Level.DEBUG);" I am enabling the 
log for debug, and then the method isDebugEnabled will return true, correct?

Best Regards,
Marco Lotz



RE: Logger output

2013-08-12 Thread Marco Aurelio Barbosa Fagnani Lotz
Thanks Ashish :)

I took a look in the directory HADOOP_BASE_PATH/logs/userlogs/job_number , but 
in the syslog there are no indications about the these logs. Right now I am 
running Giraph in a pseudo-distributed mode, so it should be in this machine.

I even tried to change from LOG.debug("") to LOG.info("") to see if it appears 
in the logs and it still didn't work. Am I missing something? Should I somehow 
initialize the LOG by a different method than just declaring it with

"private static final Logger LOG =
  Logger.getLogger(SimpleBFSComputation.class);"?

I am trying to log right now with:

"LOG.info("testinglog");"

Best Regards,
Marco Lotz

From: Ashish Jain 
Sent: 09 August 2013 18:48
To: user@giraph.apache.org
Subject: Re: Logger output

Hello Marco,

In my experiments, I have found the log output to be in the hadoop log file of 
the application. When you run your application, note down the job number. The 
hadoop log file is usually in HADOOP_BASE_PATH/logs/userlogs/job_number. In it 
you need to look at syslog, among the various lines interleaved will be the 
output of Log.

If you run your program on a cluster, you might have to find out on which node 
was the program run. One way is, if you use -op in your application, look at 
the log to see the cluster node name. Other way is to just check the 
HADOOP_BASE_PATH/logs/userlogs/job_number on all the nodes of your cluster. You 
will find output from the MasterThread and from one/more worker threads.

This is the approach I have used, there might be a better way to do this. Hope 
this helps.

Ashish


On Fri, Aug 9, 2013 at 4:43 AM, Marco Aurelio Barbosa Fagnani Lotz 
mailto:m.a.b.l...@stu12.qmul.ac.uk>> wrote:
Hello there! :)

I am writing a Giraph application but I could not find the output place for the 
logs.
Where is the default output path to see the logged info?

By log I mean the log that is inside a class that one creates:

private static final Logger LOG =
  Logger.getLogger(SimpleBFSComputation.class);

I call the following method to enable that log to debug:
Log.setLevel(Level.DEBUG);

And then write some random content in it:
if (LOG.isDebugEnabled){
LOG.debug("This is a logged line");}


Just to clarify, if I called the "Log.setLevel(Level.DEBUG);" I am enabling the 
log for debug, and then the method isDebugEnabled will return true, correct?

Best Regards,
Marco Lotz



Logger output

2013-08-09 Thread Marco Aurelio Barbosa Fagnani Lotz
Hello there! :)

I am writing a Giraph application but I could not find the output place for the 
logs.
Where is the default output path to see the logged info?

By log I mean the log that is inside a class that one creates:

private static final Logger LOG =
  Logger.getLogger(SimpleBFSComputation.class);

I call the following method to enable that log to debug:
Log.setLevel(Level.DEBUG);

And then write some random content in it:
if (LOG.isDebugEnabled){
LOG.debug("This is a logged line");}


Just to clarify, if I called the "Log.setLevel(Level.DEBUG);" I am enabling the 
log for debug, and then the method isDebugEnabled will return true, correct?

Best Regards,
Marco Lotz


Zookeeper problem when running in Pure Yarn

2013-07-20 Thread Marco Aurelio Barbosa Fagnani Lotz
Hello :)

When I run the SinglePageRankComputation example, using the following cmd line:

hadoop jar 
giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.0.3-alpha-jar-with-dependencies.jar 
org.apache.giraph.GiraphRunner 
org.apache.giraph.examples.SimplePageRankComputation -vif 
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip 
/input/input.txt -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat 
-op /outPageRank -w 1 -mc 
org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute

I am getting the following error:

"13/07/20 18:37:57 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
13/07/20 18:37:58 INFO utils.ConfigurationUtils: No edge input format 
specified. Ensure your InputFormat does not require one.
13/07/20 18:37:58 INFO yarn.GiraphYarnClient: Final output path is: 
hdfs://localhost:9000/outPageRank
13/07/20 18:37:58 INFO service.AbstractService: 
Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
Exception in thread "main" java.lang.IllegalArgumentException: Giraph on YARN 
does not currentlysupport Giraph-managed ZK instances: use a standalone 
ZooKeeper: 'null'
at 
org.apache.giraph.yarn.GiraphYarnClient.checkJobLocalZooKeeperSupported(GiraphYarnClient.java:392)
at org.apache.giraph.yarn.GiraphYarnClient.run(GiraphYarnClient.java:106)
at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:96)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:126)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)"


I couldn't fine anything about this Zookeeper error on google, any hints?
Since the PageRankBenchmark is not working on Yarn, what are the examples 
actually working on the current versions?

Oh, another thing: Does anyone knows how to remove the:

"13/07/20 18:37:57 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable"

On Hadoop Yarn 2.0.3-alpha?

Best regards,
Marco