Re: CPU utilization in each superstep

2015-04-05 Thread Matthew Saltz
I don't remember all the info it gives you but check out the giraph metrics
command line flag.

Best,
Matthew

On Sunday, April 5, 2015, Ravikant Dindokar  wrote:

> Hi
>
> I  am newbie learning Giraph on hadoop 2.2.0. I want to find out CPU
> utilization  as well as time spent in message sending across each superstep.
>
> I am not familiar with Giraph code. Can someone point out functions I
> should look into to do logging to get this information?
>
> Thanks
> Ravikant
>
>


Re: Undirected Vertex Definition and Reflexivity

2015-03-11 Thread Matthew Saltz
Hi,

I believe the answer to your question is yes, though I've never done it. If
you use only the edge reader, only the vertices in your graph that have at
least one edge attached to them will be present in your graph. So, if you
have vertices that are entirely disconnected that you want included, you'd
need to do both a VertexReader and an Edge Reader (though I've never done
this). If you don't have disconnected vertices, you don't need the
VertexReader because Giraph will automatically add all of the vertices in
your edge file to the graph (I think this can be disabled). Use the -eip
flag to specify the edge file.

Best,
Matthew Saltz

On Wed, Mar 11, 2015 at 1:54 AM, G.W.  wrote:

> Thanks for that!
>
> This is the right idea, however I was only using a VertexReader until now
> – IntNullReverseTextEdgeInputFormat calls for an EdgeReader.
>
> I am not sure this is the way it works but I like the idea of segregating
> edge and vertex definitions.
>
> *That leads to the following questions: can Giraph support the use of a
> VertexReader and EdgeReader at the same time, that is through the -vif and
> -eif arguments? *
>
> If that works, the idea would be to refactor my input files with:
>
> Vertices:
> vertex_id, vertex_type, ...
>
> Edges
> source_id, target_id
>
> with the edge reader working in "reverse" mode.
>
> Thanks!
>
>
>
>
> On 10 March 2015 at 20:02, Matthew Saltz  wrote:
>
>> Have a look at IntNullReverseTextEdgeInputFormat
>> <https://giraph.apache.org/apidocs/org/apache/giraph/io/formats/IntNullReverseTextEdgeInputFormat.html>.
>> It automatically creates reverse edges, but it expects the file format
>>
>> 
>>
>> on each line. If you need to convert it to use longs you can change the
>> code pretty easily.
>>
>> Best,
>> Matthew
>>
>> On Tue, Mar 10, 2015 at 5:37 AM, Young Han 
>> wrote:
>>
>>> The input is assumed to be the vertex followed by a set of *directed*
>>> edges. So, in your example, leaving out E2 means that the final graph will
>>> not have the directed edge from V2 to V1. To get an undirected edge, you
>>> need a pair of directed edges.
>>>
>>> Internally, Giraph stores the out-edges of each vertex as an adjacency
>>> list at that vertex. So, for example, your undirected graph becomes a
>>> vertex object V1 with an adjacency list {V2} and a vertex object V2 with an
>>> adjacency list {V1}. The directed graph would be a vertex V1 with adjacency
>>> list {V2} and a vertex V2 with an empty adjacency list {}.
>>>
>>> There's no easy way for Giraph to infer that V2's adjacency list should
>>> contain V1, because V2 does not track its in-edges. To get around this, you
>>> can either (1) use an undirected input file with both pairs of edges
>>> present; (2) have, in your algorithms, all vertices broadcast their ids to
>>> their out-edge neighbours and then perform mutations to add the missing
>>> edges in the first two superstep; or (3) modify the code in
>>> org.apache.giraph.io.* (in giraph-core) to cache and add missing edges
>>> (i.e., add a new "type" of input format). I'm fairly certain that there
>>> doesn't already exist an "assume undirected graph" input reader, but I'm
>>> not too familiar with the code paths and options there so I could be wrong.
>>>
>>> Young
>>>
>>> On Mon, Mar 9, 2015 at 11:54 PM, G.W.  wrote:
>>>
>>>> Hi Giraph Mailing List,
>>>>
>>>> I am writing about an undirected graph I am trying to move to Giraph. I
>>>> have a question about the assumption Giraph makes when processing an input.
>>>>
>>>> Let V1 and V2, two vertices connected with a common edge. E1 defines an
>>>> edge from V1 to V2. E2 defines an edge from V2 to V1.
>>>>
>>>> Simply put, these are defined in an input file as:
>>>> V1, E1
>>>> V2, E2
>>>>
>>>> This is working fine, I can process the graph accordingly.
>>>>
>>>> I was trying to see what would happen if I was to simplify the input
>>>> file:
>>>> V1, E1
>>>> V2
>>>>
>>>> When would come the time that V2 is processed in a superstep, Giraph
>>>> would not suggest E1 as an  outgoing edge. My questions is why, knowing
>>>> that E1 defines an edge from V1 to V2. The graph being undirected (although
>>>> there is no provision for that in my Giraph computation), shouldn't Giraph
>>>> assume that V2 is connected to V1?
>>>>
>>>> Down the road, the idea would be to streamline the input format, hence
>>>> my question.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>
>>
>


Re: Undirected Vertex Definition and Reflexivity

2015-03-10 Thread Matthew Saltz
Have a look at IntNullReverseTextEdgeInputFormat
.
It automatically creates reverse edges, but it expects the file format



on each line. If you need to convert it to use longs you can change the
code pretty easily.

Best,
Matthew

On Tue, Mar 10, 2015 at 5:37 AM, Young Han  wrote:

> The input is assumed to be the vertex followed by a set of *directed*
> edges. So, in your example, leaving out E2 means that the final graph will
> not have the directed edge from V2 to V1. To get an undirected edge, you
> need a pair of directed edges.
>
> Internally, Giraph stores the out-edges of each vertex as an adjacency
> list at that vertex. So, for example, your undirected graph becomes a
> vertex object V1 with an adjacency list {V2} and a vertex object V2 with an
> adjacency list {V1}. The directed graph would be a vertex V1 with adjacency
> list {V2} and a vertex V2 with an empty adjacency list {}.
>
> There's no easy way for Giraph to infer that V2's adjacency list should
> contain V1, because V2 does not track its in-edges. To get around this, you
> can either (1) use an undirected input file with both pairs of edges
> present; (2) have, in your algorithms, all vertices broadcast their ids to
> their out-edge neighbours and then perform mutations to add the missing
> edges in the first two superstep; or (3) modify the code in
> org.apache.giraph.io.* (in giraph-core) to cache and add missing edges
> (i.e., add a new "type" of input format). I'm fairly certain that there
> doesn't already exist an "assume undirected graph" input reader, but I'm
> not too familiar with the code paths and options there so I could be wrong.
>
> Young
>
> On Mon, Mar 9, 2015 at 11:54 PM, G.W.  wrote:
>
>> Hi Giraph Mailing List,
>>
>> I am writing about an undirected graph I am trying to move to Giraph. I
>> have a question about the assumption Giraph makes when processing an input.
>>
>> Let V1 and V2, two vertices connected with a common edge. E1 defines an
>> edge from V1 to V2. E2 defines an edge from V2 to V1.
>>
>> Simply put, these are defined in an input file as:
>> V1, E1
>> V2, E2
>>
>> This is working fine, I can process the graph accordingly.
>>
>> I was trying to see what would happen if I was to simplify the input
>> file:
>> V1, E1
>> V2
>>
>> When would come the time that V2 is processed in a superstep, Giraph
>> would not suggest E1 as an  outgoing edge. My questions is why, knowing
>> that E1 defines an edge from V1 to V2. The graph being undirected (although
>> there is no provision for that in my Giraph computation), shouldn't Giraph
>> assume that V2 is connected to V1?
>>
>> Down the road, the idea would be to streamline the input format, hence my
>> question.
>>
>> Thanks!
>>
>>
>>
>


Re: Giraph Partitioning

2015-02-25 Thread Matthew Saltz
Hi,

1) The partitions are processed in parallel based on the number of threads
you specify. The vertices within a partition are processed sequentially.
You may want to use more partitions than threads, that way if one partition
takes a particularly long time to be processed, the other threads can
continue processing the remaining partitions. If you have four machines
with 12 threads each for example, with one worker per machine, the default
number of partitions will be 4^2 = 16 partitions, whereas you actually have
48 threads available, so you'd probably want to specify the number of
partitions manually to a larger number to take advantage of parallelism.
2) Yes
3) If you are only doing single threading, there's no reason to do multiple
partitions per worker
3 (the second one)) I'm not familiar with the out-of-core functionality
4) I'm not sure

I'm basing this on the version of Giraph from this summer, not the most
recent release, but I don't think this part has changed. May want to verify
by looking at the code.

Best,
Matthew

On Wed, Feb 25, 2015 at 3:25 AM, Arjun Sharma  wrote:

> Hi,
>
> I understand that by default, the number of partitions = number of workers
> ^ 2. So, if we have N workers, each worker will process N partitions. I
> have a number of questions:
>
> 1- By default, does Giraph process the N partitions within a single worker
> sequentially? If yes, when setting the parameter giraph.numComputeThreads,
> will partitions within each thread be computed sequentially?
>
> 2- By default, does Giraph keep all partitions in memory?
>
> 3- If the answers to 1 and 2 are yes and yes, is there any advantage from
> using multiple partitions versus a single partition in the case of single
> threading per worker?
>
> 3- How does the out-of-core partitions affect out-of-core messages? Are
> they completely independent? For example, if the number of partitions to be
> kept in memory is set to a number less than N, and at the same time all
> messages are set to be kept in memory, will ALL messages be kept in memory,
> even those from out-of-core partitions? If the situation is reversed, where
> all partitions are kept in memory, and out-of-core messaging is set, will
> messages from memory-based partitions be saved on disk?
>
> 4- Is there a class like a PartitionContext, where you can access
> preSuperstep and postSuperstep *per partition*, along the lines of
> WorkerContext?
>
>


Re: Best way to know the assignment of vertices to workers

2014-11-28 Thread Matthew Saltz
Kiran,

To answer your question directly, in an AbstractComputation class (or
whatever descendant you're using), you may call
getWorkerContext().getMyWorkerIndex() (here
).
However, if each vertex has metadata associated to it, I think the best way
to go would be to define a custom VertexReader

and custom Vertex type to take that into account when reading the vertex.

Best,
Matthew

On Fri, Nov 28, 2014 at 1:02 PM, Garimella Kiran 
wrote:

>  Hi all,
>
>  Is there a clean way to find out which worker a particular vertex is
> assigned to?
>
>  From what I tried out, I found that given n workers, each node is
> assigned to the worker with id (vertex_id % n  ). Is that a safe way to do
> this?
>
>  I’ve had a look at previous discussions, but most of them have no answer.
>
>  —
>
>  Why I need it:
>
>  In my application, each vertex needs to know some additional meta data,
> which is loaded from file. This metadata file is huge (>50 G) and so, on
> each worker, I only want to load the metadata corresponding to the vertices
> present on that worker.
>
>  —
>
>
>  Previous discussions:
> 1.
> http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3C7EC16F82718A6D4A920A99FE46CE7F4E2861F779%40MERCMBX19R.na.SAS.com%3E
> 2.
> http://mail-archives.apache.org/mod_mbox/giraph-user/201403.mbox/%3CCAMf08QYE%2BRgUv9otXT6oPJorTNjQ-Ay8p4NUiuhds8%2BzgDzs1w%40mail.gmail.com%3E
>
>
>
>  Regards,
> Kiran
>


Re: When do Giraph vertices receive their messages?

2014-11-10 Thread Matthew Saltz
Hi Vincentius,

I'd recommend checking out the code in the call() method of this class
<https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/graph/ComputeCallable.java>
to
try to follow the logic that occurs during computation in a superstep, as
well as the code
<https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyWorkerClientRequestProcessor.java>
for handling message sending and the execute method in GraphTaskManager
<https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/graph/GraphTaskManager.java>
which
basically handles the overall control flow of everything. I've found that
for Giraph at some point you're going to more or less need to dig through
the code to figure out what's going on behind the scenes.  Looking at the
call() method and computePartition() methods in ComputeCallable are pretty
enlightening.  As far as messaging goes it appears that everything is
flushed from the sender before the end of the superstep.

Someone else please correct me if I'm wrong about any of these things; I
don't want to mislead anyone.

Best,
Matthew Saltz



On Mon, Nov 10, 2014 at 12:18 PM, Vincentius Martin <
vincentiusmar...@gmail.com> wrote:

> Hi XingFeng, thanks for your answer!
>
> Yes, I have already read Pregel paper, unfortunately there are some
> specific steps that I still couldn't grasp.
>
> Therefore, when does the checkpoint happen? Is it before/after the step 1
> (the receiving messages phase) in your explanation?
>
> Also, according to your explanation, I can deduce that at the beginning of
> each superstep, the messages are still in the sender workers' buffer and
> each of the sender workers will send them at this phase. Am I right?
>
>
> Regards,
> Vincentius Martin
>
> On Mon, Nov 10, 2014 at 5:49 PM, XingFENG 
> wrote:
>
>> Hi Vincentius Martin,
>>
>> Since Giraph is based on Pregel, I would refer you to the paper *Pregel:
>> A System for Large-Scale Graph Processing *for more details.
>>
>> Briefly speaking, in each superstep,
>> 1. a worker (which is responsible for a partition of vertices) receives
>> messages from others. A worker then divided these messages according to the
>> destID and active vertices which have incoming messages.
>> 2. a worker runs *compute* function of each active vertex. Meanwhile,
>> the *compute* function may generate messages to other vertices. These
>> messages are buffered, combined and sent in batches in an asynchronous way.
>> 3. after a worker finishes *compute* function of all active vertex, it
>> waits for all other workers finishing their *compute* functions. What is
>> more, it waits for all sending tasks to finish to ensure all messages can
>> be received in next superstep. Then every worker goes into next superstep.
>>
>> For your second problem, messages are stored in a buffer.
>>
>> On Mon, Nov 10, 2014 at 6:14 PM, Puneet Agarwal 
>> wrote:
>>
>>> These are some very interesting questions. I also would like to know the
>>> answers to these.
>>>
>>> - Puneet
>>> IIT Delhi, India
>>>
>>>
>>>   On Monday, November 10, 2014 9:30 AM, Vincentius Martin <
>>> vincentiusmar...@gmail.com> wrote:
>>>
>>>
>>> I am curious about how does Giraph receive messages before processing it
>>> I know that they use their accepted messages in the compute() method on
>>> the next superstep, but when do they receive it? If it is before the
>>> checkpoint process, is there any part in the documentation/code that I can
>>> see to understand it?
>>> Also, what mechanism that Giraph use to store messages before superstep
>>> S+1? Are they store it in a buffer or disk first?
>>> I still cannot find anything about this.
>>>
>>> Regards,
>>> Vincentius Martin
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards.
>> ---
>> Xing FENG
>> PhD Candidate
>> Database Research Group
>>
>> School of Computer Science and Engineering
>> University of New South Wales
>> NSW 2052, Sydney
>>
>> Phone: (+61) 413 857 288
>>
>
>


Re: Issue in Aggregator

2014-11-08 Thread Matthew Saltz
Hi Puneet,

It's unclear to me what you're wanting in terms of aggregator behavior. Are
you saying you want an aggregator such that the final output is the
aggregated value just for a particular worker? With an aggregator you
should at least make sure the operations you're performing are commutative;
that is, the order in which items are aggregated should not matter unless
it is explicitly dealt with somehow. Otherwise you'll get unpredictable
results.

Best,
Matthew Saltz
El 08/11/2014 15:05, "Puneet Agarwal"  escribió:

> Hi All,
> In my algo, I use an Aggregator which takes a Text value. I have written
> my custom aggregator class for this, as given below.
>
> public class MyAgg extends BasicAggregator {
> ...
> }
>
> This works fine when running on my laptop with one worker.
> However, when running it on the cluster, sometimes it does not return the
> correctly aggregated value.
> It seems it is returning the locally aggregated value of one of the
> workers.
> While it should have used my logic to decide which of the aggregated
> values sent by various worker should be chosen as finally aggregated values.
> (But in fact I have not written such a code anywhere, it is therefore
> doing the best it could)
>
> Following is how is my analysis about this issue.
> a.I guess every worker aggregates the values locally.
> b.then there is a global aggregation step, which simply compares the
> values sent by various aggregators.
> c.For global aggregation it uses Text.compareTo() method. This method
> Text.compareTo() is a default Hadoop implementation and does not include
> the logic of my program.
> d.It seem it is because of the above the value returned by my
> aggregator in the cluster is actually not globally aggregated, but the
> locally aggregated value of one of the worker gets taken.
>
> If the above analysis is correct, following is how I think I can solve
> this.
> I should write my own class that implements Writable interface. In this
> class I would also write a compareTo method as a result things will start
> working fine.
>
> If it was using class MyAgg itself, to decide which of the values returned
> by various workers should be taken as globally aggregated value then this
> problem would not have occurred.
>
> *I seek your guidance whether my analysis is correct.*
>
> - Puneet
> IIT Delhi, India
>
>


Re: How to ensure that only one worker runs per node

2014-10-30 Thread Matthew Saltz
Thanks Matthew. I just realized in searching a bit more that the variable
has to be set in the mapred-site.xml configuration file with hadoop
0.20.203, otherwise it doesn't affect anything. I was setting it in the
command line.  I've already been setting the number of workers the way you
suggested. I'll give it a try.

Best,
Matthew

On Thu, Oct 30, 2014 at 12:36 PM, Matthew Cornell 
wrote:

> As I understand it, 1) set the variable to 1 as you say, and 2)
> specify the number of workers to the number of nodes - 1 (for the
> master). When you run a job you can look at the 'map' link on the
> tasktracker ui to see all the workers plus master.
>
> On Thu, Oct 30, 2014 at 7:11 AM, Matthew Saltz  wrote:
> > Hi everyone,
> >
> > Is there a good way (a configuration I'm guessing) to prevent more than
> one
> > worker from running per node? I saw in this thread to use
> > mapred.tasktracker.map.tasks.maximum=1, but that doesn't seem to be
> working.
> > Thanks for the help.
> >
> > Best,
> > Matthew
> >
>
>
>
> --
> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34
> Dickinson Street, Amherst MA 01002 | matthewcornell.org
>


How to ensure that only one worker runs per node

2014-10-30 Thread Matthew Saltz
Hi everyone,

Is there a good way (a configuration I'm guessing) to prevent more than one
worker from running per node? I saw in this thread
 to use
mapred.tasktracker.map.tasks.maximum=1, but that doesn't seem to be
working. Thanks for the help.

Best,
Matthew


Re: Resource Allocation Model Of Apache Giraph

2014-10-24 Thread Matthew Saltz
You may set giraph.userPartitionCount= and
giraph.maxPartitionsInMemory=1.
Like Avery said though, since parallelism occurs on a partition level (each
thread processes a different partition) if you only have one partition per
worker you cannot take advantage of multithreading.

Best,
Matthew

On Fri, Oct 24, 2014 at 3:53 AM, Zhang, David (Paypal Risk) <
pengzh...@ebay.com> wrote:

>  I think no good solution. You can try to run a java application by using
> FileInputFormat.getSplits to get the size of the array, which number you
> can set to giraph workers.
>
> Or run a simple map-reduce job by using IdentityMapper to see how many
> mappers there.
>
>
>
> Thanks,
>
> Zhang, David (Paypal Risk)
>
> *From:* Charith Wickramarachchi [mailto:charith.dhanus...@gmail.com]
> *Sent:* 2014年10月24日 5:37
> *To:* user
> *Subject:* Re: Resource Allocation Model Of Apache Giraph
>
>
>
> Thanks  Claudio and Avery,
>
>
>
> I find a way way to configure hadoop to have desired number of mappers
> per machine as Claudio mentioned.
>
>
>
> Avery,
>
> Could you please tell me how I can configure giraph to make each worker
> handle only a single partition?
>
>
>
> Thanks,
> Charith
>
>
>
> On Thu, Oct 23, 2014 at 2:26 PM, Avery Ching  wrote:
>
> Regarding your second point, partitions are decoupled from workers.  A
> worker can handle zero or more partitions.  You can make each worker handle
> one partition, but we typically like multiple partitions since we can use
> multi-threading per machine.
>
>
>
> On 10/23/14, 9:04 AM, Claudio Martella wrote:
>
>  the way mappers (or containers) and hence workers are assigned to
> machines is not under the control of giraph, but of the underlying hadoop
> environment (with different responsibilities that depend on the hadoop
> version, e.g. YARN). You'll have to tweak your hadoop configuration to
> control the maximum number of workers assigned to one machine (optimally
> one with multiple threads).
>
>
>
> On Thu, Oct 23, 2014 at 5:53 PM, Charith Wickramarachchi <
> charith.dhanus...@gmail.com> wrote:
>
> Hi Folks,
>
>
>
> I'm wondering what is the resource allocation model for Apache Giraph.
>
>
>
> As I understand each worker is one to one Mapped with a Mapper and a
> worker can process multiple partitions with a user defined number of
> threads.
>
>
>
> Is it possible to make sure that one worker, only process a single
> partition? Also is it possible to control the worker assignment in the
> cluster nodes? (Ex: Make sure only N  workers runs on a single machine,
> assuming we have enough resources)
>
>
>
> Thanks,
>
> Charith
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> Charith Dhanushka Wickramaarachchi
>
>
>
> Tel  +1 213 447 4253
>
> Web  http://apache.org/~charith 
>
> Blog  http://charith.wickramaarachchi.org/
> 
>
> Twitter  @charithwiki 
>
>
>
> This communication may contain privileged or other
> confidential information and is intended exclusively for the addressee/s.
> If you are not the intended recipient/s, or believe that you may have
> received this communication in error, please reply to the
> sender indicating that fact and delete the copy you received and in
> addition, you should not print, copy, retransmit, disseminate, or otherwise
> use the information contained in this communication.
> Internet communications cannot be guaranteed to be timely, secure, error
> or virus-free. The sender does not accept liability for any errors
> or omissions
>
>
>
>
>
> --
>
>Claudio Martella
>
>
>
>
>
>
>
>
> --
>
> Charith Dhanushka Wickramaarachchi
>
>
>
> Tel  +1 213 447 4253
>
> Web  http://apache.org/~charith 
>
> Blog  http://charith.wickramaarachchi.org/
> 
>
> Twitter  @charithwiki 
>
>
>
> This communication may contain privileged or other
> confidential information and is intended exclusively for the addressee/s.
> If you are not the intended recipient/s, or believe that you may have
> received this communication in error, please reply to the
> sender indicating that fact and delete the copy you received and in
> addition, you should not print, copy, retransmit, disseminate, or otherwise
> use the information contained in this communication.
> Internet communications cannot be guaranteed to be timely, secure, error
> or virus-free. The sender does not accept liability for any errors
> or omissions
>


Removing edges with ByteArrayPartition not working

2014-10-23 Thread Matthew Saltz
Hi everyone,

I just switched from using SimplePartition to using ByteArrayPartition to
save memory, but now my code does not work as expected. I narrowed the
problem down to edge removal. I'm not sure exactly why but calling
vertex.removeEdge(id) doesn't behave as it does with SimplePartition. It
doesn't throw an exception or anything, it just doesn't seem to correctly
remove edges. Does anyone know if this might be a bug from the 1.1.0
release? I couldn't find anything and I'm not sure if this should be going
out to the developer or user listserv, so forgive me.

Best,
Matthew


Excessive Memory Usage Compared to Graph Size

2014-10-23 Thread Matthew Saltz
Hi everyone,

I'm working on a community detection algorithm for giraph and I'm trying to
execute the algorithm on the Friendster graph, which has about 65M nodes
and about 1.8 billion edges. Running on 16 machines, before doing ANY
processing, it's taking about 50G of RAM. That's 800G total for this graph,
which seems excessive. I'm using the 1.1.0 stable release and Hadoop
0.20.203 (should I use a newer version of hadoop?).  This is the command
I'm running:

$HADOOP_HOME/bin/hadoop --config $CONF jar $GIR_JAR
org.apache.giraph.GiraphRunner -D 'mapred.child.java.opts=-Xms80G -Xmx80G'
-D 'mapred.tasktracker.map.tasks.maximum=1' -D-Xmx6m -libjars $LIBJARS
computation.StartComputation $GIRAPH_OPTIONS -eif
org.apache.giraph.io.formats.IntNullReverseTextEdgeInputFormat -eip $INPUT
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op $OUTPUT
-w $N_WORKERS -mc computation.WccMasterCompute

where N_THREADS=12 (number of available threads on the machine),
N_PARTITIONS=3*N_THREADS*N_WORKERS, and

GIRAPH_OPTIONS="-ca giraph.useSuperstepCounters=false
-ca giraph.numComputeThreads=$N_THREADS
-ca giraph.numInputThreads=$N_THREADS
-ca giraph.numOutputThreads=$N_THREADS
-ca giraph.oneToAllMsgSending=true
-ca giraph.metrics.enable=true
-ca giraph.maxPartitionsInMemory=$N_THREADS
-ca giraph.userPartitionCount=$N_PARTITIONS
-ca giraph.outEdgesClass=utils.IntNullHashSetEdges"

IntNullHashSetEdges takes a bit more memory than IntNullArrayEdges I know
but it doesn't make that big of a difference .

Each vertex contains 7 ints, a double, three arrays and a map, but all of
these are empty when the graph is loaded and it still takes that much
memory.

I feel like I must be doing something wrong, or missing a configuration
option, or something.Thanks in advance for any help you might be able
to offer.

Best,
Matthew


Re: Multiple sendMessage calls vs. sendMessageToMultipleEdges

2014-10-22 Thread Matthew Saltz
Actually,  one more question: are there any disadvantages to enabling
oneToAllMessaging? Is there any reason not to do it by default?

Best,
Matthew
El 22/10/2014 23:28, "Matthew Saltz"  escribió:

> Lukas,
>
> Thank you so much for the help. By 'the first class', you mean 
> SendMessageToAllCache
> is not used unless I set the property to true, right? Because I actually do
> have giraph.oneToAllMsgSending=true, so if that means it's using
> SendMessageToAllCache  then everything makes much more sense. So I guess
> it makes sense then that case (b) that I mentioned that would be much
> faster than case (a)? I really appreciate it.  And do you have any ideas
> about the second question I asked? I think the answer is no but I'm kind of
> hoping it's not.
>
> Best,
> Matthew
>
>
>
> On Wed, Oct 22, 2014 at 11:16 PM, Lukas Nalezenec <
> lukas.naleze...@firma.seznam.cz> wrote:
>
>>  Hi Matthew,
>>
>> See class SendMessageToAllCache. Its in the same directory as
>> SendMessageCache. The first class is not used by Giraph unless you set
>> property giraph.oneToAllMsgSending to true.
>>
>> Lukas
>>
>>
>> On 22.10.2014 20:10, Matthew Saltz wrote:
>>
>> Hi everyone,
>>
>> I have two questions:
>>
>>  *Question 1)* I'm using release 1.1.0 and I'm really confused about the
>> fact that I'm having massive performance differences in the following
>> scenario. I need to send one message from each vertex to a subset of its
>> neighbors (all that satisfy a certain condition). For that, I see two basic
>> options:
>>
>> a) Loop over all edges, making a call to sendMessage(source, target)
>> whenever target satisfies a condition I want, reusing the same IntWritable
>> for the target vertex by calling target.set(_)
>>b) Loop over all edges, building up an ArrayList (or whatever) of
>> targets that satisfy the condition, and calling
>> sendMessageToMultipleMessages(targets) at the end.
>>
>>  Surprisingly, I get much, much worse performance using option (a),
>> which I would think would be much faster. So I looked in the code and
>> eventually found my way to SendMessageCache
>> <https://github.com/apache/giraph/blob/release-1.1/giraph-core/src/main/java/org/apache/giraph/comm/SendMessageCache.java>,
>> where it turns out that sendMessageToMultipleMessages ->
>> sendMessageToAllRequest(Iterator, Message) actually just loops over the
>> iterator, repeatedly calling sendMessageRequest (which is what I thought I
>> was doing in scenario (a). I might have incorrectly traced the code though.
>> Can anyone tell me what might be going on? I'm really puzzled by this.
>>
>>  *Question 2) *Is there a good way of sending a vertex's adjacency list
>> to its neighbors, without building up your own copy of an adjacency list
>> and then sending that? I'm going through the Edge iterable and building an
>> ArrayPrimitiveWritable of ids but it would be nice if I could somehow
>> access the underlying data structure behind the iterable or just wrap the
>> iterable as a writable somehow.
>>
>>  Thanks so much for the help,
>> Matthew Saltz
>>
>>
>>
>>
>>
>


Re: Multiple sendMessage calls vs. sendMessageToMultipleEdges

2014-10-22 Thread Matthew Saltz
Lukas,

Thank you so much for the help. By 'the first class', you mean
SendMessageToAllCache
is not used unless I set the property to true, right? Because I actually do
have giraph.oneToAllMsgSending=true, so if that means it's using
SendMessageToAllCache  then everything makes much more sense. So I guess it
makes sense then that case (b) that I mentioned that would be much faster
than case (a)? I really appreciate it.  And do you have any ideas about the
second question I asked? I think the answer is no but I'm kind of hoping
it's not.

Best,
Matthew



On Wed, Oct 22, 2014 at 11:16 PM, Lukas Nalezenec <
lukas.naleze...@firma.seznam.cz> wrote:

>  Hi Matthew,
>
> See class SendMessageToAllCache. Its in the same directory as
> SendMessageCache. The first class is not used by Giraph unless you set
> property giraph.oneToAllMsgSending to true.
>
> Lukas
>
>
> On 22.10.2014 20:10, Matthew Saltz wrote:
>
> Hi everyone,
>
> I have two questions:
>
>  *Question 1)* I'm using release 1.1.0 and I'm really confused about the
> fact that I'm having massive performance differences in the following
> scenario. I need to send one message from each vertex to a subset of its
> neighbors (all that satisfy a certain condition). For that, I see two basic
> options:
>
> a) Loop over all edges, making a call to sendMessage(source, target)
> whenever target satisfies a condition I want, reusing the same IntWritable
> for the target vertex by calling target.set(_)
>b) Loop over all edges, building up an ArrayList (or whatever) of
> targets that satisfy the condition, and calling
> sendMessageToMultipleMessages(targets) at the end.
>
>  Surprisingly, I get much, much worse performance using option (a), which
> I would think would be much faster. So I looked in the code and eventually
> found my way to SendMessageCache
> <https://github.com/apache/giraph/blob/release-1.1/giraph-core/src/main/java/org/apache/giraph/comm/SendMessageCache.java>,
> where it turns out that sendMessageToMultipleMessages ->
> sendMessageToAllRequest(Iterator, Message) actually just loops over the
> iterator, repeatedly calling sendMessageRequest (which is what I thought I
> was doing in scenario (a). I might have incorrectly traced the code though.
> Can anyone tell me what might be going on? I'm really puzzled by this.
>
>  *Question 2) *Is there a good way of sending a vertex's adjacency list
> to its neighbors, without building up your own copy of an adjacency list
> and then sending that? I'm going through the Edge iterable and building an
> ArrayPrimitiveWritable of ids but it would be nice if I could somehow
> access the underlying data structure behind the iterable or just wrap the
> iterable as a writable somehow.
>
>  Thanks so much for the help,
> Matthew Saltz
>
>
>
>
>


Multiple sendMessage calls vs. sendMessageToMultipleEdges

2014-10-22 Thread Matthew Saltz
Hi everyone,

I have two questions:

*Question 1)* I'm using release 1.1.0 and I'm really confused about the
fact that I'm having massive performance differences in the following
scenario. I need to send one message from each vertex to a subset of its
neighbors (all that satisfy a certain condition). For that, I see two basic
options:

   a) Loop over all edges, making a call to sendMessage(source, target)
whenever target satisfies a condition I want, reusing the same IntWritable
for the target vertex by calling target.set(_)
   b) Loop over all edges, building up an ArrayList (or whatever) of
targets that satisfy the condition, and calling
sendMessageToMultipleMessages(targets) at the end.

Surprisingly, I get much, much worse performance using option (a), which I
would think would be much faster. So I looked in the code and eventually
found my way to SendMessageCache
<https://github.com/apache/giraph/blob/release-1.1/giraph-core/src/main/java/org/apache/giraph/comm/SendMessageCache.java>,
where it turns out that sendMessageToMultipleMessages ->
sendMessageToAllRequest(Iterator, Message) actually just loops over the
iterator, repeatedly calling sendMessageRequest (which is what I thought I
was doing in scenario (a). I might have incorrectly traced the code though.
Can anyone tell me what might be going on? I'm really puzzled by this.

*Question 2) *Is there a good way of sending a vertex's adjacency list to
its neighbors, without building up your own copy of an adjacency list and
then sending that? I'm going through the Edge iterable and building an
ArrayPrimitiveWritable of ids but it would be nice if I could somehow
access the underlying data structure behind the iterable or just wrap the
iterable as a writable somehow.

Thanks so much for the help,
Matthew Saltz


Re: giraph.metrics.enable

2014-10-10 Thread Matthew Saltz
Use the -ca flag, for each one. Metrics.enable puts metrics from the run in
the stderr file of each worker by default. I've never used the command to
change the directory. If that's not what you're looking for let me know,
I'll be glad to help.

Best,
Matthew
El 10/10/2014 19:55, "xuhong zhang"  escribió:

> Dear everyone,
>
> How to use  "giraph.metrics.enable=true" and
> "giraph.metrics.directory=dir" options.
>
> Where should i specify them in the command?
>
> thanks
>
> --
> Xuhong Zhang
>


Re: getAggregatedValue calling aggregate

2014-09-23 Thread Matthew Saltz
HI Puneet,

What are you trying to do in getAggregatedValue()? Is there any reason you
don't just want to return the current value of the aggregator (which is
what the default implementation does)?

Best,
Matthew

On Sat, Sep 20, 2014 at 6:25 AM, Puneet Agarwal  wrote:

> I have created my own aggregator, by extending the BasicAggregator.
>
> In this aggregator I have overridden following methods
>
> aggregate()
> createInitialValue()
> getAggregatedValue()
>
> (The documentation says that we need to implement aggregate and 
> createInitialValue
> methods only, if so, how and what will it return when getAggregatedValue()
> is called?)
>
> My main problems are:
>
> a. When I call the method getAggregatedValue() from the compute method of
> my vertex class, nothing gets returned
> b. When I call the method getAggregatedValue() from the compute method of
> my vertex class, aggregate() method of the aggregator class is getting
> called and it received the value returned by the method
> getAggregatedValue().
>
> am I missing something very basic here?
>
> Please Help
>
> Puneet Agarwal
> IIT Delhi
>


Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

2014-09-22 Thread Matthew Saltz
Hi Matthew,

I answered a few of your questions in-line (unfortunately they might not
help the larger problem, but hopefully it'll help a bit).

Best,
Matthew


On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell 
wrote:

> Hi Folks,
>
> I've spent the last two months learning, installing, coding, and
> analyzing the performance of our Giraph app, and I'm able to run on
> small inputs on our tiny cluster (yay!) I am now stuck trying to
> figure out why larger inputs fail, why only some compute nodes are
> being used, and generally how to make sure I've configured hadoop and
> giraph to use all available CPUs and RAM. I feel that I'm "this
> close," and I could really use some pointers.
>
> Below I share our app, configuration, results and log messages, some
> questions, and counter output for the successful run. My post here is
> long (I've broken it into sections delimited with ''), but I hope
> I've provided good enough information to get help on. I'm happy to add
> to it.
>
> Thanks!
>
>
>  application 
>
> Our application is a kind of path search where all nodes have a type
> and source database ID (e.g., "movie 99"), and searches are expressed
> as type paths, such as "movie, acted_in, actor", which would start
> with movies and then find all actors in each movie, for all movies in
> the database. The program does a kind of filtering by keeping track of
> previously-processed initial IDs.
>
> Our database is a small movie one with 2K movies, 6K users (people who
> rate movies), and 80K ratings of movies by users. Though small, we've
> found this kind of search can result in a massive explosion of
> messages, as was well put by Rob Vesse (
>
> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
> ):
>
> > even with this relatively small graph you get a massive explosion of
> messages by the later super steps which exhausts memory (in my graph the
> theoretical maximum messages by the last super step was ~3 billion)
>
>
>  job failure and error messages 
>
> Currently I have a four-step path that completes in ~20 seconds
> ("rates, movie, rates, user" - counter output shown at bottom) but a
> five-step one ("rates, movie, rates, user, rates") fails after a few
> minutes. I've looked carefully at the task logs, but I find it a
> little difficult to discern what the actual failure was. However,
> looking at system information (e.g., top and ganglia) during the run
> indicates hosts are running out of memory. There are no
> OutOfMemoryErrors in the logs, and only this one stsands out:
>
> > ERROR org.apache.giraph.master.BspServiceMaster:
> superstepChosenWorkerAlive: Missing chosen worker
> Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>
> NB: So far I've been ignoring these other types of messages:
>
> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
> No last good checkpoints can be found, killing the job.
>
> > java.io.FileNotFoundException: File
> _bsp/_checkpoints/job_201409191450_0003 does not exist.
>
> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
> event
> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
> type=NodeDeleted, state=SyncConnected)
>
> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
> failure, unregistering health on
> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
> on superstep 4
>
> The counter statistics are minimal after the run fails, but during it
> I see something like this when refreshing the Job Tracker Web UI:
>
> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: ~28GB
> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
> > Counters > Giraph Stats > Sent messages: ~181M
>
>
>  hadoop/giraph command 
>
> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
> -Dgiraph.zkList=xx.xx.xx.edu:2181 \
> -libjars ${LIBJARS} \
> relpath.RelPathVertex \
> -wc relpath.RelPathWorkerContext \
> -mc relpath.RelPathMasterCompute \
> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
> -vip $REL_PATH_INPUT \
> -of relpath.RelPathVertexValueTextOutputFormat \
> -op $REL_PATH_OUTPUT \
> -ca RelPathVertex.path=$REL_PATH_PATH \
> -w 8
>
>
>  cluster, versions, and configuration 
>
> We have a five-node cluster with a head and four compute nodes. The
> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
> and 48 cores.
>
> Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
> on the head node, and task trackers on all five nodes.
>
> Hadoop configuration (mapred-site.xml and CDH interface - sorry for
> the mix) - not sure I'm listing all of them of interest:
>
> > mapreduce.job.counters.max: 120
> > mapred.

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

2014-09-22 Thread Matthew Saltz
Sorry, should be
"*org.apache.giraph.utils.MemoryUtils.getRuntimeMemoryStats()",
*I left out the giraph.

On Mon, Sep 22, 2014 at 8:10 PM, Matthew Saltz  wrote:

> Hi Matthew,
>
> I answered a few of your questions in-line (unfortunately they might not
> help the larger problem, but hopefully it'll help a bit).
>
> Best,
> Matthew
>
>
> On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell 
> wrote:
>
>> Hi Folks,
>>
>> I've spent the last two months learning, installing, coding, and
>> analyzing the performance of our Giraph app, and I'm able to run on
>> small inputs on our tiny cluster (yay!) I am now stuck trying to
>> figure out why larger inputs fail, why only some compute nodes are
>> being used, and generally how to make sure I've configured hadoop and
>> giraph to use all available CPUs and RAM. I feel that I'm "this
>> close," and I could really use some pointers.
>>
>> Below I share our app, configuration, results and log messages, some
>> questions, and counter output for the successful run. My post here is
>> long (I've broken it into sections delimited with ''), but I hope
>> I've provided good enough information to get help on. I'm happy to add
>> to it.
>>
>> Thanks!
>>
>>
>>  application 
>>
>> Our application is a kind of path search where all nodes have a type
>> and source database ID (e.g., "movie 99"), and searches are expressed
>> as type paths, such as "movie, acted_in, actor", which would start
>> with movies and then find all actors in each movie, for all movies in
>> the database. The program does a kind of filtering by keeping track of
>> previously-processed initial IDs.
>>
>> Our database is a small movie one with 2K movies, 6K users (people who
>> rate movies), and 80K ratings of movies by users. Though small, we've
>> found this kind of search can result in a massive explosion of
>> messages, as was well put by Rob Vesse (
>>
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
>> ):
>>
>> > even with this relatively small graph you get a massive explosion of
>> messages by the later super steps which exhausts memory (in my graph the
>> theoretical maximum messages by the last super step was ~3 billion)
>>
>>
>>  job failure and error messages 
>>
>> Currently I have a four-step path that completes in ~20 seconds
>> ("rates, movie, rates, user" - counter output shown at bottom) but a
>> five-step one ("rates, movie, rates, user, rates") fails after a few
>> minutes. I've looked carefully at the task logs, but I find it a
>> little difficult to discern what the actual failure was. However,
>> looking at system information (e.g., top and ganglia) during the run
>> indicates hosts are running out of memory. There are no
>> OutOfMemoryErrors in the logs, and only this one stsands out:
>>
>> > ERROR org.apache.giraph.master.BspServiceMaster:
>> superstepChosenWorkerAlive: Missing chosen worker
>> Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>>
>> NB: So far I've been ignoring these other types of messages:
>>
>> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
>> No last good checkpoints can be found, killing the job.
>>
>> > java.io.FileNotFoundException: File
>> _bsp/_checkpoints/job_201409191450_0003 does not exist.
>>
>> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
>> event
>> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>> type=NodeDeleted, state=SyncConnected)
>>
>> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
>> failure, unregistering health on
>> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
>> on superstep 4
>>
>> The counter statistics are minimal after the run fails, but during it
>> I see something like this when refreshing the Job Tracker Web UI:
>>
>> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot:
>> ~28GB
>> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
>> > Counters > Giraph Stats > Sent messages: ~181M
>>
>>
>>  hadoop/giraph command 
>>
>> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache

Re: how do I maintain a cached List across supersteps?

2014-09-16 Thread Matthew Saltz
Hey Matt,

If you need to share data between all vertices that persists across
supersteps and that's created/determined at runtime, I believe an
aggregator is the best way to do this. Then you can declare an instance
variable within the Computation class and within the preSuperstep method of
the class you can use getAggregatedValue to set the value of the variable.
Unless you can afford to just reparse the argument within the Computation
class at each superstep, in which case you can use getConf() and give that
a try.

Best,
Matthew
El 16/09/2014 21:43, "Matthew Cornell"  escribió:

> Hi Folks. I have a custom argument that's passed into my Giraph job that
> needs parsing. The parsed value is accessed by my Vertex#compute. To avoid
> excessive GC I'd like to cache the parsing results. What's a good way to do
> so? I looked at using the ImmutableClassesGiraphConfiguration returned by
> getConf(), but it supports only String properties. I looked at using my
> custom MasterCompute to manage it, but I couldn't find how to access the
> master compute instance from the vertex. My last idea is to use (abuse?) an
> aggregator to do this. I'd appreciate your thoughts! -- matt
>
> --
> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson
> Street, Amherst MA 01002 | matthewcornell.org
>


Re: Dynamic graphs vs Graph Mutation

2014-09-15 Thread Matthew Saltz
Khaled,

1) What exact functionality are you looking for? There isn't native Giraph
support per se for time-evolving graphs in that sense (as far as I'm aware)
but you should be able to simulate it in some fashion.

2) Check out AbstractComputation
.
 When you want to call methods for adding/removing edges/vertices other
than the Vertex you're calling from, you'll be calling them from here. If
you look at the actual code, that leads you into the comm package with
WorkerClientRequestProcessor
,
and into NettyWorkerClientRequestProcessor

where
things are actually implemented.

Hope that helps,
Matthew



On Mon, Sep 15, 2014 at 7:29 PM, Khaled Ammar 
wrote:

> Hi all,
>
> I understand that Pregel supports graph mutations when vertices issue
> requests to make changes in the graph topology (add/remove vertex/edge). I
> have two questions here:
>
> 1- Does the same concept applies to supporting dynamic graphs? (Graphs
> that change over time where the add/remove vertex/edge is read from a file
> instead of being generated by the vertex?)
> 2- Which class/function handles graph mutations in the source code?
>
> --
> Thanks,
> -Khaled
>


Re: Problem processing large graph

2014-09-11 Thread Matthew Saltz
Hi Tripti,

How many machines are you running on? The ideal configuration would be one
worker per machine and one separate machine for the master. If you're using
more mappers than machines then you're using more resources than necessary,
and fixing that could help.

Best,
Matthew
El 11/09/2014 13:39, "Tripti Singh"  escribió:

>  Hi Avery,
> Thanks for your reply.
> I did adjust the heap and container size to a higher value(3072MB and
> 4096MB respectively) and I am not running the out of core option as well.
> I am intermittently able to run the job with 200 mappers. At other times,
> I can run part of the data while the other part gets stalled.
> FYI, I am using netty without authentication.
> One thing that I have noticed, however, is that it mostly runs
> successfully when the queue is running almost idle or is comparatively
> free. On most instances when the queue is running more tasks or is
> over-allocated, my job stalls even when the required number of containers
> are allocated. Looking at the logs, I mostly find it stalling at Superstep
> 0 or 1 after finishing Superstep –1. Or sometimes, even at –1.
> Could there be some shared resources in the queue which are not enough for
> the job while the job runs on a loaded queue and can I configure some other
> value to make it run?
>
>  Tripti Singh
> Tech Yahoo, Software Sys Dev Eng
> P: +91 080.30516197  M: +91 9611907150
> Yahoo Software Development India Pvt. Ltd Torrey Pines Bangalore 560 071
>
>
>   From: Avery Ching 
> Reply-To: "user@giraph.apache.org" 
> Date: Wednesday, September 3, 2014 at 10:53 PM
> To: "user@giraph.apache.org" 
> Subject: Re: Problem processing large graph
>
>   Hi Tripti,
>
> Is there a chance you can use higher memory machines so you don't run out
> of core?  We do it this way at Facebook.  We've haven't tested the
> out-of-core option.
>
> Avery
>
> On 8/31/14, 2:34 PM, Tripti Singh wrote:
>
>  Hi,
>  I am able to successfully build hadoop_yarn profile for running Giraph
> 1.1.
>  I am also able to test run Connected Components on a small dataset.
>  However, I am seeing 2 issues while running on a bigger dataset with 400
> mappers:
>
>1. I am unable to use out of Core Graph option. It errors out saying
>that it cannot read INIT partition. (Sorry I don’t have the log currently
>but I will share after I run that again).
>I am expecting that if the out of Core option is fixed, I should be
>able to run the workflow with less mappers.
>2. In order to run the workflow anyhow, I removed the out of Core
>option and adjusted the heap size. This also runs with smaller dataset but
>fails with huge dataset.
>Worker logs are mostly empty. Non-empty logs end like this:
>
>
>
>
>
>
>
>
>
>
> *mapred.task.partition is deprecated. Instead, use
>mapreduce.task.partition [STATUS: task-374] setup: Beginning worker setup.
>setup: Log level remains at info [STATUS: task-374] setup: Initializing
>Zookeeper services. mapred.job.id  is deprecated.
>Instead, use mapreduce.job.id  job.local.dir is
>deprecated. Instead, use mapreduce.job.local.dir [STATUS: task-374] setup:
>Setting up Zookeeper manager. createCandidateStamp: Made the directory
>_bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614
>createCandidateStamp: Made the directory
>
> _bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614/_zkServer
>createCandidateStamp: Creating my filestamp
>
> _bsp/_defaultZkManagerDir/giraph_yarn_application_1407992474095_708614/_task/gsta33201.tan.ygrid.yahoo.com
> 374 getZooKeeperServerList: For task
>374, got file 'null' (polling period is 3000) *
>
>Master log has log statements for launching the container, opening
>proxy and processing event like this:
>
>
> *Opening proxy : gsta31118.tan.ygrid.yahoo.com:8041
>  Processing Event EventType:
>QUERY_CONTAINER for Container container_1407992474095_708614_01_000314 ……*
>
> I am not using SASL authentication.
> Any idea what might be wrong?
>
>  Thanks,
> Tripti.
>
>
>
>


Re: How do I validate customArguments?

2014-09-10 Thread Matthew Saltz
No worries. Just by the way, I realized after I sent that that using the
"public static int numPreprocessingSteps" to store the value in the
MasterCompute class doesn't work; you need to register a permanent
aggregator to hold on to it, if you need to.

Best,
Matthew

On Wed, Sep 10, 2014 at 3:15 PM, Matthew Cornell 
wrote:

> Sorry for the long delay, Matthew. That's really helpful. Right now I'm
> stuck on apparently running out of memory on our little cluster, but the
> log messages are confusing. I'm putting together a question, but in the
> meantime I'll try one of the simpler examples such as degree count to see
> if /anything/ will run against my graph, which is very small (100K and
> edges nodes). -- matt
>
> On Thu, Aug 28, 2014 at 2:26 PM, Matthew Saltz  wrote:
>
>> Matt,
>>
>> I'm not sure if you've resolved this problem already or not, but if you
>> haven't: The initialize() method isn't limited to registering aggregators,
>> and in fact, in my project I use it to do exactly what you're describing to
>> check and load custom configuration parameters. Inside the initialize()
>> method, I do this:
>>
>> *String numPreprocessingStepsConf =
>> getConf().get(NUMBER_OF_PREPROCESSING_STEPS_CONF_OPT);*
>> *numPreprocessingSteps = (numPreprocessingStepsConf != null) ?*
>> *Integer.parseInt(numPreprocessingStepsConf.trim()) :*
>> *DEFAULT_NUMBER_OF_PREPROCESSING_STEPS;*
>> *System.out.println("Number of preprocessing steps: " +
>> numPreprocessingSteps);*
>>
>> where at the class level I declare:
>>
>>   public static final String NUMBER_OF_PREPROCESSING_STEPS_CONF_OPT =
>> "wcc.numPreprocessingSteps";
>>   public static final int DEFAULT_NUMBER_OF_PREPROCESSING_STEPS = 1;
>>   public static int numPreprocessingSteps;
>>
>> To set the property, I use the option "-ca
>> wcc.numPreprocessingSteps=". If you need to check
>> that it's properly formatted and not store them, this is a fine place to do
>> it as well, given that it's run before the input superstep (see the giraph
>> code in BspServiceMaster, line 1617 in the stable 1.1.0 release). What
>> happens is that on the master, the MasterThread calls coordinateSuperstep()
>> on a BspServiceMaster object, which checks if it's the input superstep, and
>> if so, calls initialize() on the MasterCompute object (created in the
>> becomeMaster() method of BspServiceMaster).
>>
>> Hope this helps,
>> Matthew
>>
>>
>>
>> On Tue, Aug 26, 2014 at 4:36 PM, Matthew Cornell > > wrote:
>>
>>> Hi again. My application needs to pass in a String argument to the
>>> computation which each Vertex needs access to. (The argument is a list of
>>> the form "[item1, item2, ...]".) I found --customArguments (which I set in
>>> my tests via conf.set(, )) but I need to check that it's
>>> properly formatted. Where do I do that? The only thing I thought of is to
>>> specify a DefaultMasterCompute subclass whose initialize() does the check,
>>> but all the initialize() examples do is register aggregators; none of them
>>> check args or do anything else. Thanks in advance! -- matt
>>>
>>> --
>>> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson
>>> Street, Amherst MA 01002 | matthewcornell.org
>>>
>>
>>
>
>
> --
> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson
> Street, Amherst MA 01002 | matthewcornell.org
>


Re: Which is better: sending many small messages or fewer large ones?

2014-09-04 Thread Matthew Saltz
Hi Matthew,

Starting with your P.S.: It's not nutty; see MapWritable
<https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/io/MapWritable.html>
for
example, which can be used as a message type, or ArrayPrimitiveWritable
<http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/io/ArrayPrimitiveWritable.html>.
In this project <https://github.com/grafos-ml/okapi>, which I've found
helpful in getting inspiration for things as I'm getting started, they use
collections for messages in multiple places.

Going back to your main question: When you say many small vs fewer large
messages, I guess you mean that they'd both be sent in the same superstep?
If that's the case, I'd recommend just testing it since it's difficult to
say, but also my thought is that you could wrap the set in a primitive
collection like ArrayPrimitiveWritable if you go with the large message
approach, and you might save a bit of memory that you're sending out,
rather than sending a bunch of small ones as LongWritables or whatever it
might be. If I remember correctly, with the project I'm working on, I tried
both approaches and the large message approach was more effective. Then,
there's also the option of (if you run into problems with memory, for
example) using large messages but splitting the one superstep into
multiples if it's feasible. In the end I've found that it's difficult to
predict how it will perform, and it never hurts to try both approaches to
take a look at the result.

Everyone else, please correct me if I've said something incorrectly, as I'm
still relatively new at this.

Best,
Matthew Saltz



On Thu, Sep 4, 2014 at 8:16 PM, Matthew Cornell 
wrote:

> Hi Everyone,
>
> I have an app whose messaging granularity could be written two ways -
> sending many small messages vs. (possibly far) fewer larger ones.
> Conceptually what moves around is a set of 'alive' vertex IDs that might
> get filtered at each superstep based on a processed list (vertex value)
> that vertexes manage. The ones that survive to the end are the lucky
> winners. compute() calculates a set of 'new-to-me' incoming IDs that are
> perfect for the outgoing message, but I could easily send each ID one at a
> time. My guess is that sending fewer messages is more important, but the
> each set might contain thousands of IDs.
>
> Thanks!
>
> P.S. A side question: The few custom message type examples I've found are
> relatively simple objects with a few primitive instance variables, rather
> than collections. Is it nutty to send around a collection of IDs as a
> message?
>
> --
> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson
> Street, Amherst MA 01002 | matthewcornell.org
>


Re: How do I validate customArguments?

2014-08-28 Thread Matthew Saltz
Matt,

I'm not sure if you've resolved this problem already or not, but if you
haven't: The initialize() method isn't limited to registering aggregators,
and in fact, in my project I use it to do exactly what you're describing to
check and load custom configuration parameters. Inside the initialize()
method, I do this:

*String numPreprocessingStepsConf =
getConf().get(NUMBER_OF_PREPROCESSING_STEPS_CONF_OPT);*
*numPreprocessingSteps = (numPreprocessingStepsConf != null) ?*
*Integer.parseInt(numPreprocessingStepsConf.trim()) :*
*DEFAULT_NUMBER_OF_PREPROCESSING_STEPS;*
*System.out.println("Number of preprocessing steps: " +
numPreprocessingSteps);*

where at the class level I declare:

  public static final String NUMBER_OF_PREPROCESSING_STEPS_CONF_OPT =
"wcc.numPreprocessingSteps";
  public static final int DEFAULT_NUMBER_OF_PREPROCESSING_STEPS = 1;
  public static int numPreprocessingSteps;

To set the property, I use the option "-ca
wcc.numPreprocessingSteps=". If you need to check
that it's properly formatted and not store them, this is a fine place to do
it as well, given that it's run before the input superstep (see the giraph
code in BspServiceMaster, line 1617 in the stable 1.1.0 release). What
happens is that on the master, the MasterThread calls coordinateSuperstep()
on a BspServiceMaster object, which checks if it's the input superstep, and
if so, calls initialize() on the MasterCompute object (created in the
becomeMaster() method of BspServiceMaster).

Hope this helps,
Matthew



On Tue, Aug 26, 2014 at 4:36 PM, Matthew Cornell 
wrote:

> Hi again. My application needs to pass in a String argument to the
> computation which each Vertex needs access to. (The argument is a list of
> the form "[item1, item2, ...]".) I found --customArguments (which I set in
> my tests via conf.set(, )) but I need to check that it's
> properly formatted. Where do I do that? The only thing I thought of is to
> specify a DefaultMasterCompute subclass whose initialize() does the check,
> but all the initialize() examples do is register aggregators; none of them
> check args or do anything else. Thanks in advance! -- matt
>
> --
> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson
> Street, Amherst MA 01002 | matthewcornell.org
>


Re: Setting variable value in Compute class and using it in the next superstep

2014-07-21 Thread Matthew Saltz
Yeah, that's true. Sorry I forgot that part. Luckily, it isn't too tricky
either, depending on the input format of your graph. Here's another example
<https://gist.github.com/saltzm/ab7172c57dec927061be> to get you started,
for a very simple input format for edges with no values. I basically took
the code straight from here
<http://giraph.apache.org/apidocs/org/apache/giraph/io/formats/LongLongNullTextInputFormat.html>
and
modified where I needed to it to return the InputFormat that I needed for
my code. You'll probably be better off digging through some of the already
implemented InputFormat classes that come with Giraph to do something
similar, since I'm guessing your input files will be different than mine.
Take a look at the subclasses of TextVertexInputFormat
<http://giraph.apache.org/apidocs/org/apache/giraph/io/formats/TextVertexInputFormat.html>,
since they deal with a lot of common input format styles, and see if you
can modify their code to work with your custom vertex data format. Now, the
example I give you is also easy because I just use the default constructor
of the class, but if you need to load additional data from the file into
your vertex data and the default constructor isn't appropriate, you may
have to do some extra parsing and legwork for that.

Best of luck,
Matthew



On Tue, Jul 22, 2014 at 12:28 AM, Sardeshmukh, Vivek <
vivek-sardeshm...@uiowa.edu> wrote:

>  Thank you Matthew for the example link. It is helpful. I'll give it a
> shot.
>
>
>  If I have a custom vertex class isn't it necessary to change the
> VertexInputFormat class too? Since this class "loads" the data into the
> vertex and if vertex has a custom value field then it doesn't know how to
> load the input. Am I right?
>
>
>
> Vivek
>   --
> *From:* Schweiger, Tom 
> *Sent:* Monday, July 21, 2014 5:16 PM
> *To:* user@giraph.apache.org
> *Subject:* RE: Setting variable value in Compute class and using it in
> the next superstep
>
>
> For more than one flag, a custom class is necessary (unless you're able
> to, say, toggle the sign bit to get double usage out or a value).
>
> I've started a private thread with Vivek to get a better understanding of
> what he was trying to solve.
>
> And you are also correct that there isn't much to writing a custom vertex
> class.  The key is making sure you read and write in the same order.
> Likewise, extending a vertex reader can be quite simple.
>
>  --
> *From:* Matthew Saltz [sal...@gmail.com]
> *Sent:* Monday, July 21, 2014 3:09 PM
> *To:* user@giraph.apache.org
> *Subject:* Re: Setting variable value in Compute class and using it in
> the next superstep
>
>   Tom,
>
>  If it's necessary to store more than one flag though, for example, won't
> a custom class be necessary? I'm a beginner too, so I apologize if I'm
> incorrect about that. Just to be clarify, to keep persistent data for a
> vertex from one superstep to the next, it is necessary to encapsulate it in
> the type used for the 'V', right? In other words, if Vivek tries to use a
> normal member variable for the Computation class, it won't work will it?
>
> Also, just to point out, there actually isn't too much involved with
> writing your own custom vertex class. Here's a quick example
> <https://gist.github.com/saltzm/692fba1d3aade035ce9c> to get you started.
> Within your compute() method you can access the data in this class by doing
>
>  SampleVertexData d = vertex.getValue();
>
>  and then using d.setFlag(true) or boolean currentFlag = d.getFlag() for
> example.  And your computation class is now something like
>
>  public class MyComputation extends BasicComputation<*IdType, *
> *SampleVertexData*, *EdgeType, MessageType>* {
> @Override
> public void compute(Vertex<*IdType, **SampleVertexData*, *EdgeType*>
> vertex, Iterable<*MessageType> messages) {.} *
>
> *...*
>
>  }
>
>  As a warning, for this class I'm using Hadoop 0.20.203 and I'm also a
> beginner, so take everything I say with a grain of salt, and Tom please
> correct me if I'm wrong.
>
>  Best of luck,
> Matthew
>
>
> On Mon, Jul 21, 2014 at 11:37 PM, Schweiger, Tom 
> wrote:
>
>>  And in answer of :
>>
>>
>> This post also suggests (along with what I described above) to have a
>> field in the vertex value itself. For that I need to change the vertex
>> input format and also create my own custom vertex class. Is it really
>> necessary?
>>
>>  No, you don't need a custom vertex class or vertex input format. You can
>>

Re: Setting variable value in Compute class and using it in the next superstep

2014-07-21 Thread Matthew Saltz
Tom,

If it's necessary to store more than one flag though, for example, won't a
custom class be necessary? I'm a beginner too, so I apologize if I'm
incorrect about that. Just to be clarify, to keep persistent data for a
vertex from one superstep to the next, it is necessary to encapsulate it in
the type used for the 'V', right? In other words, if Vivek tries to use a
normal member variable for the Computation class, it won't work will it?

Also, just to point out, there actually isn't too much involved with
writing your own custom vertex class. Here's a quick example
 to get you started.
Within your compute() method you can access the data in this class by doing

SampleVertexData d = vertex.getValue();

and then using d.setFlag(true) or boolean currentFlag = d.getFlag() for
example.  And your computation class is now something like

public class MyComputation extends BasicComputation<*IdType, *
*SampleVertexData*, *EdgeType, MessageType>* {
@Override
public void compute(Vertex<*IdType, **SampleVertexData*, *EdgeType*>
vertex, Iterable<*MessageType> messages) {.} *

*...*

}

As a warning, for this class I'm using Hadoop 0.20.203 and I'm also a
beginner, so take everything I say with a grain of salt, and Tom please
correct me if I'm wrong.

Best of luck,
Matthew


On Mon, Jul 21, 2014 at 11:37 PM, Schweiger, Tom 
wrote:

>  And in answer of :
>
>
> This post also suggests (along with what I described above) to have a
> field in the vertex value itself. For that I need to change the vertex
> input format and also create my own custom vertex class. Is it really
> necessary?
>
> No, you don't need a custom vertex class or vertex input format. You can
> create/initialize the value at the beginning of the first superstep.
>
>  --
> *From:* Sardeshmukh, Vivek [vivek-sardeshm...@uiowa.edu]
> *Sent:* Monday, July 21, 2014 2:05 PM
> *To:* user@giraph.apache.org
> *Subject:* Setting variable value in Compute class and using it in the
> next superstep
>
>   Hi, all--
>
>
>  In my algorithm, I need to set a flag if certain conditions hold
> (locally at a vertex v). If this flag is set then execute some other block
> of code *only once*, and do nothing until some other condition is hold.
>
>
>  My question is, can I declare a flag variable in the class where I
> override compute function? I defined the flag as a public variable and
> setting it once the conditions are met but it seems the value is not
> "carried" over to the next superstep.
>
> I dig a little bit in this mailing list and found this
>
> https://www.mail-archive.com/user@giraph.apache.org/msg01266.html
>
>
>  This post also suggests (along with what I described above) to have a
> field in the vertex value itself. For that I need to change the vertex
> input format and also create my own custom vertex class. Is it really
> necessary?
>
>
>  By the way, I am using Giraph 1.1.0 compiled against Hadoop 1.0.3. I was
> able to run SimpleShortestPathComputation successfully.
>
>
>  Here are more technical details of my algorithm: I am trying to
> implement Delta-stepping shortest path algorithm (
> http://dl.acm.org/citation.cfm?id=740136 or
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.2200 ). This
> was mentioned in Pregel paper. A vertex "relax" light edges if it belongs
> to the minimum bucket index (of course, aggregators!). Once a vertex is
> done with relaxing light edges it relaxes heavy edges (here is where I need
> a flag) once. A vertex may be "re-inserted" to a newer bucket and may have
> to execute all the steps that I described here again.
>
>
>  Thanks.
>
>
>  Sincerely,
>   Vivek
> A beginner in Giraph (and Java too!)
>
>


Re: Aggregator result type different from aggregate input type

2014-07-17 Thread Matthew Saltz
Also, I just thought of another possibility/question:

Is there any way to dynamically register aggregators? In other words,
instead of doing a Map, it would be ideal to just be able to register an
aggregator for each id, but on the fly, since I don't know what all the ids
will be in advance.

Thanks again for the help.

Matthew


On Thu, Jul 17, 2014 at 11:53 AM, Matthew Saltz  wrote:

> Hi everyone,
>
> I'm trying to implement my own aggregator, whose aggregated value should
> be a Map (for which I can use MapWritable) from an id (LongWritable) to a
> custom defined type (which simply extends Writable) that contains several
> aggregate metrics. I want vertices to be able to do something along the
> lines of
>
> aggregate(MY_MAP_AGGREGATOR, new MyAggregatorMessage(id, stat1, stat2));
>
> and then the map aggregator will do something like
>
> public void aggregate(MyAggregatorMessage m) {
>
> MapWritable currentMap = (MapWritable) getAggregatedValue();
>
> if (!currentMap.containsKey(m.getId())) {
> // MyAggregatorData contains the aggregate info I want to keep for
> // each id. Contains init. values for stat1 and stat2
> currentMap.put(m.getId(), new MyAggregatorData());
> }
>
> MyAggregatorData oldData = currentMap.get(m.getId());
> // Performs appropriate aggregates for each stat and stores it. Sum,
> // average, whatever
> oldData.aggregate(m.getStat1(), m.getStat2());
> }
>
> However, the problem is that the method signatures
> <https://giraph.apache.org/apidocs/org/apache/giraph/aggregators/Aggregator.html>for
> Aggregator all have to use the same type. In other words, I can't have
>
> public MapWritable getAggregatedValue()
>
> and
>
> public void aggregate (MyAggregatorMessage m)
>
> because the types are different.
>
> My idea right now is to use a MyAggregatorWritable class that extends
> GenericWritable
> <http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/GenericWritable.html>
>  to
> wrap both MyAggregatorMessage and MapWritable and then use that as the
> method signature for both, and deal with the rest through casting. I've
> already used GenericWritable for something else so the implementation would
> be straightforward.
>
> So, I have a few principle questions, I suppose:
>
> 1) Is there a better way to implement this than to use a GenericWritable
> as described above? If any of you have code for your own way to do this,
> I'd love to see it, and if not, I'd love to contribute what I come up with
> as a MapAggregator (in a generic manner) to the Giraph project if that
> would be appropriate.
>
> 2) Is there anything wrong in principle with this type of solution? In
> other words, is there some kind of philosophical or design reason that
> having a Map as an aggregator is a bad idea? I know that it might not end
> up being very efficient, but as it stands, I'm not seeing any other
> solution to my problem; if there's an ordinary kind of workaround that
> would be more efficient I'd love to hear it.
>
> 3) [Less important and more discussion oriented] Why is the API designed
> such that these methods must use the same type? It seems like having an
> Aggregator would be useful.
>
> I apologize for the quite long message, and I appreciate any help you can
> offer. If you need any other information, please let me know and I'll be
> happy to provide it. In trying to simplify everything I easily could have
> made a mistake or left out something important.  Thanks in advance.
>
> Best,
> Matthew
> http://www.matthewsaltz.com
>
>


Aggregator result type different from aggregate input type

2014-07-17 Thread Matthew Saltz
Hi everyone,

I'm trying to implement my own aggregator, whose aggregated value should be
a Map (for which I can use MapWritable) from an id (LongWritable) to a
custom defined type (which simply extends Writable) that contains several
aggregate metrics. I want vertices to be able to do something along the
lines of

aggregate(MY_MAP_AGGREGATOR, new MyAggregatorMessage(id, stat1, stat2));

and then the map aggregator will do something like

public void aggregate(MyAggregatorMessage m) {

MapWritable currentMap = (MapWritable) getAggregatedValue();

if (!currentMap.containsKey(m.getId())) {
// MyAggregatorData contains the aggregate info I want to keep for
// each id. Contains init. values for stat1 and stat2
currentMap.put(m.getId(), new MyAggregatorData());
}

MyAggregatorData oldData = currentMap.get(m.getId());
// Performs appropriate aggregates for each stat and stores it. Sum,
// average, whatever
oldData.aggregate(m.getStat1(), m.getStat2());
}

However, the problem is that the method signatures
for
Aggregator all have to use the same type. In other words, I can't have

public MapWritable getAggregatedValue()

and

public void aggregate (MyAggregatorMessage m)

because the types are different.

My idea right now is to use a MyAggregatorWritable class that extends
GenericWritable

to
wrap both MyAggregatorMessage and MapWritable and then use that as the
method signature for both, and deal with the rest through casting. I've
already used GenericWritable for something else so the implementation would
be straightforward.

So, I have a few principle questions, I suppose:

1) Is there a better way to implement this than to use a GenericWritable as
described above? If any of you have code for your own way to do this, I'd
love to see it, and if not, I'd love to contribute what I come up with as a
MapAggregator (in a generic manner) to the Giraph project if that would be
appropriate.

2) Is there anything wrong in principle with this type of solution? In
other words, is there some kind of philosophical or design reason that
having a Map as an aggregator is a bad idea? I know that it might not end
up being very efficient, but as it stands, I'm not seeing any other
solution to my problem; if there's an ordinary kind of workaround that
would be more efficient I'd love to hear it.

3) [Less important and more discussion oriented] Why is the API designed
such that these methods must use the same type? It seems like having an
Aggregator would be useful.

I apologize for the quite long message, and I appreciate any help you can
offer. If you need any other information, please let me know and I'll be
happy to provide it. In trying to simplify everything I easily could have
made a mistake or left out something important.  Thanks in advance.

Best,
Matthew
http://www.matthewsaltz.com