RE: Best way to know the assignment of vertices to workers

2014-11-28 Thread Pavan Kumar A
I wrote a diff sometime ago where you can easily do that. 
You can find implementation details at - 
https://issues.apache.org/jira/browse/GIRAPH-908  
https://reviews.apache.org/r/22234/
Some options you can use are
-Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore
-Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# 
Mapping tore ops information
-Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps
# Embed mapping information
-Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge
# PartitionerFactory to be used
-Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory
And like vertex input  edge input we now have a mapping inputI only 
implemented all these for giraph-hive, so if u have a hive table with the 
mapping vertexId - workerNumthen u can pass the mapping input like
org.apache.giraph.hive.input.mapping.examples.LongInt2ByteHiveToMapping, 
$mapping_table, $mapping_partition
You can go through the code for each of these options to see what they do. 
Using this you can sort of pre-assign workers to vertex ids, now if u assign 
two vertices to a worker say worker-1, it is guaranteed they are both present 
in the same worker, the numbering (aka identification/naming) of workers is 
consistent (i.e, if a, b are assigned worker-x, they are guaranteed to be in 
the same worker but we do not know which worker that would be ahead in time), 
but cannot be explicitly set by the user. (which is what you want to do from 
what I can tell)
If you are using something else, other than hive then you will have to 
implement all the interfaces of MappingInputFormat and then u can easily 
achieve what you want.
From: kiran.garime...@aalto.fi
To: user@giraph.apache.org
Subject: Best way to know the assignment of vertices to workers
Date: Fri, 28 Nov 2014 12:02:59 +






Hi all,



Is there a clean way to find out which worker a particular vertex is assigned 
to?



From what I tried out, I found that given n workers, each node is assigned to 
the worker with id (vertex_id % n  ). Is that a safe way to do this?




I’ve had a look at previous discussions, but most of them have no answer.




—



Why I need it:



In my application, each vertex needs to know some additional meta data, which 
is loaded from file. This metadata file is huge (50 G) and so, on each worker, 
I only want to load the metadata corresponding to the vertices present on that 
worker.



—






Previous discussions:
1. 
http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3C7EC16F82718A6D4A920A99FE46CE7F4E2861F779%40MERCMBX19R.na.SAS.com%3E
2. 
http://mail-archives.apache.org/mod_mbox/giraph-user/201403.mbox/%3CCAMf08QYE%2BRgUv9otXT6oPJorTNjQ-Ay8p4NUiuhds8%2BzgDzs1w%40mail.gmail.com%3E









Regards,
Kiran 

RE: Best way to know the assignment of vertices to workers

2014-11-28 Thread Pavan Kumar A
I looked at the code again  does not seem like workerList is sorted, etc. so 
by knowing a worker number there is no consistent way to tell the actual worker 
details each time. Lukas was working on such a diff sometime back. Perhaps he 
can answer more.
From: pava...@outlook.com
To: user@giraph.apache.org
Subject: RE: Best way to know the assignment of vertices to workers
Date: Sat, 29 Nov 2014 11:23:39 +0530




I wrote a diff sometime ago where you can easily do that. 
You can find implementation details at - 
https://issues.apache.org/jira/browse/GIRAPH-908  
https://reviews.apache.org/r/22234/
Some options you can use are
-Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore
-Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# 
Mapping tore ops information
-Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps
# Embed mapping information
-Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge
# PartitionerFactory to be used
-Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory
And like vertex input  edge input we now have a mapping inputI only 
implemented all these for giraph-hive, so if u have a hive table with the 
mapping vertexId - workerNumthen u can pass the mapping input like
org.apache.giraph.hive.input.mapping.examples.LongInt2ByteHiveToMapping, 
$mapping_table, $mapping_partition
You can go through the code for each of these options to see what they do. 
Using this you can sort of pre-assign workers to vertex ids, now if u assign 
two vertices to a worker say worker-1, it is guaranteed they are both present 
in the same worker, the numbering (aka identification/naming) of workers is 
consistent (i.e, if a, b are assigned worker-x, they are guaranteed to be in 
the same worker but we do not know which worker that would be ahead in time), 
but cannot be explicitly set by the user. (which is what you want to do from 
what I can tell)
If you are using something else, other than hive then you will have to 
implement all the interfaces of MappingInputFormat and then u can easily 
achieve what you want.
From: kiran.garime...@aalto.fi
To: user@giraph.apache.org
Subject: Best way to know the assignment of vertices to workers
Date: Fri, 28 Nov 2014 12:02:59 +






Hi all,



Is there a clean way to find out which worker a particular vertex is assigned 
to?



From what I tried out, I found that given n workers, each node is assigned to 
the worker with id (vertex_id % n  ). Is that a safe way to do this?




I’ve had a look at previous discussions, but most of them have no answer.




—



Why I need it:



In my application, each vertex needs to know some additional meta data, which 
is loaded from file. This metadata file is huge (50 G) and so, on each worker, 
I only want to load the metadata corresponding to the vertices present on that 
worker.



—






Previous discussions:
1. 
http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3C7EC16F82718A6D4A920A99FE46CE7F4E2861F779%40MERCMBX19R.na.SAS.com%3E
2. 
http://mail-archives.apache.org/mod_mbox/giraph-user/201403.mbox/%3CCAMf08QYE%2BRgUv9otXT6oPJorTNjQ-Ay8p4NUiuhds8%2BzgDzs1w%40mail.gmail.com%3E









Regards,
Kiran   
  

RE: Graph partitioning and data locality

2014-11-04 Thread Pavan Kumar A
You can also look at https://issues.apache.org/jira/browse/GIRAPH-908which 
solves the case where you have a partition map and would like graph to be 
partitioned that way after loading the input. It does not however solve the {do 
not shuffle data part}

From: claudio.marte...@gmail.com
Date: Tue, 4 Nov 2014 16:20:21 +0100
Subject: Re: Graph partitioning and data locality
To: user@giraph.apache.org

Hi,
answers are inline.
On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns martin.jungha...@gmx.net 
wrote:
Hi group,



I got a question concerning the graph partitioning step. If I understood the 
code correctly, the graph is distributed to n partitions by using 
vertexID.hashCode()  n. I got two questions concerning that step.



1) Is the whole graph loaded and partitioned only by the Master? This would 
mean, the whole data has to be moved to that Master map job and then moved to 
the physical node the specific worker for the partition runs on. As this sounds 
like a huge overhead, I further inspected the code:

I saw that there is also a WorkerGraphPartitioner and I assume he calls the 
partitioning method on his local data (lets say his local HDFS blocks) and if 
the resulting partition for a vertex is not himself, the data gets moved to 
that worker, which reduces the overhead. Is this assumption correct?

That is correct, workers forward vertex data to the correct worker who is 
responsible for that vertex via hash-partitioning (by default), meaning that 
the master is not involved. 


2) Let's say the graph is already partitioned in the file system, e.g. blocks 
on physical nodes contain logical connected graph nodes. Is it possible to just 
read the data as it is and skip the partitioning step? In that case I currently 
assume, that the vertexID should contain the partitionID and the custom 
partitioning would be an identity function in that case (instead of hashing or 
range).

In principle you can. You would need to organize splits so that they contain 
all the data for each particular worker, and then assign relevant splits to the 
corresponding worker. 


Thanks for your time and help!



Cheers,

Martin



-- 
Claudio Martella
   
  

RE: Using a custom graph partitioning stratergy with giraph

2014-10-01 Thread Pavan Kumar A
I will write a detailed explanation in weekend. thanks for your interest.

Date: Wed, 1 Oct 2014 10:56:16 -0700
Subject: Re: Using a custom graph partitioning stratergy with giraph
From: charith.dhanus...@gmail.com
To: user@giraph.apache.org

Thanks Pavan, 
I get the high level level idea. I am still new to Giraph code base so I am 
still trying to understand the overall design. 
So I have few questions regarding this feature. 
Can we use this feature with a vertex input format without using edge 
translation? (Since   getPartition in MappingStoreOps  can be used to get the 
partition of any target vertex)

Also, since I have mapping information in a separate file  do I need to embed 
target information in the vertex?  
It will be great if you could explain your scenario with dataformat you used 
and what extension points you used so that I can understand it better and adapt 
to my scenario. 

Thanks,Charith



On Mon, Sep 29, 2014 at 3:34 PM, Pavan Kumar A pava...@outlook.com wrote:



we have two inputs - vertex  edgesif we partition edges vertices based on a 
map, then when we want to send messages we should be able to know which 
partition a vertex is on.
typically we send messages to targetIds of outgoing edges, edge transation 
helps encode mapping information into targetIds, so knowing which partition to 
send a message can be done by just looking at the targetid
Date: Mon, 29 Sep 2014 14:37:22 -0700
Subject: Re: Using a custom graph partitioning stratergy with giraph
From: charith.dhanus...@gmail.com
To: user@giraph.apache.org

Hi Pavan, 
Thanks for the details. I went through the code specially the extension points 
you mentioned. I am not clear about the function of the edge Translation 
(org.apache.giraph.mapping.translate.TranslateEdge) class. Could you please 
explain the idea of this translation process. 

In my case I will have a mapping file which maps each vertex to a partition. ex:
v1 part1v2 part2v3 part3 ...
So I was thinking of passing this as a parameter and reading inside my own 
MappingStore Implementation 
-Dgiraph.mappingFilePath=/user/charith/input/mapping.txt
Is there a better approach? 
Thanks,Charith







On Sun, Sep 28, 2014 at 8:29 AM, Pavan Kumar A pava...@outlook.com wrote:



I worked on this feature sometime back - but I only worked on inputting hive 
file  not hdfs
You can use logic outside giraph to select which partition file to use - this 
is possible because you input the number of workers anyway.For instance in the 
script that you use to launch a giraph job have a selection logic for the 
partition file
You can take a look at : https://issues.apache.org/jira/browse/GIRAPH-908You 
might have to extend upon the jira for your specific use case - I only added 
support for case when id = longwritable
Here is a list of options you might want to explore
 # Mapping Store related information
-Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore
-Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# 
Mapping tore ops information
-Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps
# Embed mapping information
-Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge
# PartitionerFactory to be used
-Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory
So the partition map is stored here as map of byte arrays. with 
.lbMappingStoreUpper being size of map and lbMappingStoreLower being size of 
individual arrays
Please explore code  tell me what else you need.ThanksDate: Sat, 27 Sep 2014 
22:51:29 -0700
Subject: Re: Using a custom graph partitioning stratergy with giraph
From: charith.dhanus...@gmail.com
To: user@giraph.apache.org

Also adding some more information. 
My current understanding is I should be able to do this by  my own 
org.apache.giraph.partition.WorkerGraphPartitioner implementation.
But my question is, Is there are a way to get some outside input inside the 
WorkerGraphPartitioner?In my case it will be an hdfs file location. 

Thanks,Charith







 
On Sat, Sep 27, 2014 at 10:13 PM, Charith Wickramarachchi 
charith.dhanus...@gmail.com wrote:
Hi, 
I m trying to use giraph with a custom graph partitioner that I have. In my 
case i want to assign vertices to workers based on a custom partitioner input. 
In my case partitioner will take number of workers as an input parameter and 
give me a file which maps each vertex id to a worker. I m trying load this file 
to a hdfs location and use it as an input to the giraph and do the vertex 
assignment.  
Any suggestions or pointers on best way to this will be highly appricated (Use 
the current extention points of giraph as much as possible to avoid random 
hacks). 
I m currently using giraph-1.0.0.
Thanks,Charith



-- 
Charith Dhanushka Wickramaarachchi
Tel  +1 213 447 4253Web  http://apache.org/~charithBlog  
http

RE: Graph re-partitioning

2014-09-29 Thread Pavan Kumar A
If you are using hashpartitioning, then as long as number of workers is same, 
partitions will remain unchanged, though they might run on a different worker. 
However, yes graph is always partitioned.
Date: Mon, 29 Sep 2014 15:01:37 -0400
Subject: Graph re-partitioning
From: xuhongne...@gmail.com
To: user@giraph.apache.org

Hello,
Will Giraph re-partition the graph each time running a job on this graph?
Is there anyway to directly load the partitioned graph from last job?
Thanks
-- 
Xuhong Zhang
  

RE: Using a custom graph partitioning stratergy with giraph

2014-09-29 Thread Pavan Kumar A
we have two inputs - vertex  edgesif we partition edges vertices based on a 
map, then when we want to send messages we should be able to know which 
partition a vertex is on.
typically we send messages to targetIds of outgoing edges, edge transation 
helps encode mapping information into targetIds, so knowing which partition to 
send a message can be done by just looking at the targetid
Date: Mon, 29 Sep 2014 14:37:22 -0700
Subject: Re: Using a custom graph partitioning stratergy with giraph
From: charith.dhanus...@gmail.com
To: user@giraph.apache.org

Hi Pavan, 
Thanks for the details. I went through the code specially the extension points 
you mentioned. I am not clear about the function of the edge Translation 
(org.apache.giraph.mapping.translate.TranslateEdge) class. Could you please 
explain the idea of this translation process. 

In my case I will have a mapping file which maps each vertex to a partition. ex:
v1 part1v2 part2v3 part3 ...
So I was thinking of passing this as a parameter and reading inside my own 
MappingStore Implementation 
-Dgiraph.mappingFilePath=/user/charith/input/mapping.txt
Is there a better approach? 
Thanks,Charith







On Sun, Sep 28, 2014 at 8:29 AM, Pavan Kumar A pava...@outlook.com wrote:



I worked on this feature sometime back - but I only worked on inputting hive 
file  not hdfs
You can use logic outside giraph to select which partition file to use - this 
is possible because you input the number of workers anyway.For instance in the 
script that you use to launch a giraph job have a selection logic for the 
partition file
You can take a look at : https://issues.apache.org/jira/browse/GIRAPH-908You 
might have to extend upon the jira for your specific use case - I only added 
support for case when id = longwritable
Here is a list of options you might want to explore
 # Mapping Store related information
-Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore
-Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# 
Mapping tore ops information
-Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps
# Embed mapping information
-Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge
# PartitionerFactory to be used
-Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory
So the partition map is stored here as map of byte arrays. with 
.lbMappingStoreUpper being size of map and lbMappingStoreLower being size of 
individual arrays
Please explore code  tell me what else you need.ThanksDate: Sat, 27 Sep 2014 
22:51:29 -0700
Subject: Re: Using a custom graph partitioning stratergy with giraph
From: charith.dhanus...@gmail.com
To: user@giraph.apache.org

Also adding some more information. 
My current understanding is I should be able to do this by  my own 
org.apache.giraph.partition.WorkerGraphPartitioner implementation.
But my question is, Is there are a way to get some outside input inside the 
WorkerGraphPartitioner?In my case it will be an hdfs file location. 

Thanks,Charith







 
On Sat, Sep 27, 2014 at 10:13 PM, Charith Wickramarachchi 
charith.dhanus...@gmail.com wrote:
Hi, 
I m trying to use giraph with a custom graph partitioner that I have. In my 
case i want to assign vertices to workers based on a custom partitioner input. 
In my case partitioner will take number of workers as an input parameter and 
give me a file which maps each vertex id to a worker. I m trying load this file 
to a hdfs location and use it as an input to the giraph and do the vertex 
assignment.  
Any suggestions or pointers on best way to this will be highly appricated (Use 
the current extention points of giraph as much as possible to avoid random 
hacks). 
I m currently using giraph-1.0.0.
Thanks,Charith



-- 
Charith Dhanushka Wickramaarachchi
Tel  +1 213 447 4253Web  http://apache.org/~charithBlog  
http://charith.wickramaarachchi.org/Twitter  @charithwiki
This communication may contain privileged or other confidential information and 
is intended exclusively for the addressee/s. If you are not the intended 
recipient/s, or believe that you may havereceived this communication in error, 
please reply to the sender indicating that fact and delete the copy you 
received and in addition, you should not print, copy, retransmit, disseminate, 
or otherwise use the information contained in this communication. Internet 
communications cannot be guaranteed to be timely, secure, error or virus-free. 
The sender does not accept liability for any errors or omissions




-- 
Charith Dhanushka Wickramaarachchi
Tel  +1 213 447 4253Web  http://apache.org/~charithBlog  
http://charith.wickramaarachchi.org/Twitter  @charithwiki
This communication may contain privileged or other confidential information and 
is intended exclusively for the addressee/s. If you are not the intended 
recipient/s, or believe that you may havereceived

RE: Using a custom graph partitioning stratergy with giraph

2014-09-28 Thread Pavan Kumar A
I worked on this feature sometime back - but I only worked on inputting hive 
file  not hdfs
You can use logic outside giraph to select which partition file to use - this 
is possible because you input the number of workers anyway.For instance in the 
script that you use to launch a giraph job have a selection logic for the 
partition file
You can take a look at : https://issues.apache.org/jira/browse/GIRAPH-908You 
might have to extend upon the jira for your specific use case - I only added 
support for case when id = longwritable
Here is a list of options you might want to explore
 # Mapping Store related information
-Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore
-Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# 
Mapping tore ops information
-Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps
# Embed mapping information
-Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge
# PartitionerFactory to be used
-Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory
So the partition map is stored here as map of byte arrays. with 
.lbMappingStoreUpper being size of map and lbMappingStoreLower being size of 
individual arrays
Please explore code  tell me what else you need.ThanksDate: Sat, 27 Sep 2014 
22:51:29 -0700
Subject: Re: Using a custom graph partitioning stratergy with giraph
From: charith.dhanus...@gmail.com
To: user@giraph.apache.org

Also adding some more information. 
My current understanding is I should be able to do this by  my own 
org.apache.giraph.partition.WorkerGraphPartitioner implementation.
But my question is, Is there are a way to get some outside input inside the 
WorkerGraphPartitioner?In my case it will be an hdfs file location. 

Thanks,Charith







 
On Sat, Sep 27, 2014 at 10:13 PM, Charith Wickramarachchi 
charith.dhanus...@gmail.com wrote:
Hi, 
I m trying to use giraph with a custom graph partitioner that I have. In my 
case i want to assign vertices to workers based on a custom partitioner input. 
In my case partitioner will take number of workers as an input parameter and 
give me a file which maps each vertex id to a worker. I m trying load this file 
to a hdfs location and use it as an input to the giraph and do the vertex 
assignment.  
Any suggestions or pointers on best way to this will be highly appricated (Use 
the current extention points of giraph as much as possible to avoid random 
hacks). 
I m currently using giraph-1.0.0.
Thanks,Charith



-- 
Charith Dhanushka Wickramaarachchi
Tel  +1 213 447 4253Web  http://apache.org/~charithBlog  
http://charith.wickramaarachchi.org/Twitter  @charithwiki
This communication may contain privileged or other confidential information and 
is intended exclusively for the addressee/s. If you are not the intended 
recipient/s, or believe that you may havereceived this communication in error, 
please reply to the sender indicating that fact and delete the copy you 
received and in addition, you should not print, copy, retransmit, disseminate, 
or otherwise use the information contained in this communication. Internet 
communications cannot be guaranteed to be timely, secure, error or virus-free. 
The sender does not accept liability for any errors or omissions




-- 
Charith Dhanushka Wickramaarachchi
Tel  +1 213 447 4253Web  http://apache.org/~charithBlog  
http://charith.wickramaarachchi.org/Twitter  @charithwiki
This communication may contain privileged or other confidential information and 
is intended exclusively for the addressee/s. If you are not the intended 
recipient/s, or believe that you may havereceived this communication in error, 
please reply to the sender indicating that fact and delete the copy you 
received and in addition, you should not print, copy, retransmit, disseminate, 
or otherwise use the information contained in this communication. Internet 
communications cannot be guaranteed to be timely, secure, error or virus-free. 
The sender does not accept liability for any errors or omissions

  

RE: receiving messages that I didn't send

2014-09-23 Thread Pavan Kumar A
Can you give more context?What are the types of messages, patch of your compute 
method, etc.You will not receive messages that are not sent, but one thing that 
can happen is-- message can have multiple parameters.suppose message objects 
can have 2 parametersm - a,bsay in m's write(out) you do not handle the case of 
b = nullm1 sets bm2 has b=nullthen because of incorrect code for m's write() m2 
can show b = m1.bthat is because message objects will be re-used when 
receiving. This is a Giraph gotcha, because ofobject reuse in most iterators.
Thanks

 From: m...@matthewcornell.org
 Date: Tue, 23 Sep 2014 10:10:48 -0400
 Subject: receiving messages that I didn't send
 To: user@giraph.apache.org
 
 Hi Folks. I am refactoring my compute()  to use a set of ids as its
 message type, and in my tests it is receiving a message that it
 absolutely did not send. I've debugged it and am at a loss.
 Interestingly, I encountered this once before and solved it by
 creating a copy of a Writeable instead of re-using it, but I haven't
 been able to solve it this time. In general, does this anomalous
 behavior indicate a Giraph/Hadoop gotcha'? It's really confounding!
 Thank very much -- matt
 
 -- 
 Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34
 Dickinson Street, Amherst MA 01002 | matthewcornell.org
  

RE: NegativeArraySizeException with large dataset

2014-09-09 Thread Pavan Kumar A

yes, you should implement your own edge store.please take a look at 
ByteArrayEdges for example and modify it to use BigDataOutput  BigDataInput 
instead of ExtendedByteArrayOutput/Input.
From: and...@wizardapps.net
To: user@giraph.apache.org
Subject: Re: NegativeArraySizeException with large dataset
Date: Tue, 9 Sep 2014 09:58:36 -0700






Great, thanks for pointing me in the right direction. All of the edge values 
are strings (in a Text object) and point to and from vertices with Text IDs, 
but none of the values should be greater than 60 bytes or so during the loading 
step. The size will increase during computation because I am modifying the 
values of the edges, but the actual size of the data is not too large.

 
Given that I am using text based IDs and values, it looks to me like I may have 
to implement my own edge store-- does that seem right?

 
Thank you for your help!
 
-- 

Andrew

 

 
 
On Mon, Sep 8, 2014, at 05:31 PM, Pavan Kumar A wrote:

ByteArrayEdges or any of the other edge stores used array based/ map based 
stores, all of these will encounter this exception when size of the array 
approaches Integer.MAX

some things to consider for time being, what do your edges look like?

if they are long ids  null values u can use LongNullArrayEdges to push the 
boundary a bit i.e, until u get a vertex who has ~2 billion outgoing edges

for long ids  double values u can use LongDoubleArrayEdges etc.

 
please take a look at classes that implement this interface OutEdges

 
If none of those work, you can implement one of your own

and use a store backed by datastructures like BigDataOutput instead of plain 
old ByteArrays

 
From: and...@wizardapps.net

To: user@giraph.apache.org

Subject: NegativeArraySizeException with large dataset

Date: Mon, 8 Sep 2014 17:19:17 -0700

 
Hey,

 
I am currently running Giraph on a semi-large dataset of 600 million edges (the 
edges are directed, so I've used the ReverseEdgeDuplicator for an expected 
total of 1.2b edges). I am running into an issue during superstep -1 when the 
edges are being loaded-- I receive a java.lang.NegativeArraySizeException 
exception. This occurs near the end of when the edges should be done loading-- 
by my estimate, I believe around 1b out of the 1.2b have been loaded.

 
The exception occurs on one of the workers, and all of the other workers 
subsequently halt loading before I kill the job.

 
The issue doesn't occur with half of the dataset (300 million edges, 600 
million total with the reverser).

 
The only reference I've found to this particular exception type is GIRAPH-821 
(https://issues.apache.org/jira/browse/GIRAPH-821), which suggests to enable 
the useBigDataIOForMessages flag. I would be surprised if it helped, because 
this error occurs during the loading superstep, and there are no super 
vertices in my traversal computation. Enabling this flag had no effect.

 
Any help on this would be appreciated.

 
The full stack trace for the exception is as follows:

 
java.lang.NegativeArraySizeException

at 
org.apache.giraph.utils.UnsafeByteArrayOutputStream.ensureSize(UnsafeByteArrayOutputStream.java:116)

at 
org.apache.giraph.utils.UnsafeByteArrayOutputStream.write(UnsafeByteArrayOutputStream.java:167)

at org.apache.hadoop.io.Text.write(Text.java:282)

at 
org.apache.giraph.utils.WritableUtils.writeEdge(WritableUtils.java:501)

at org.apache.giraph.edge.ByteArrayEdges.add(ByteArrayEdges.java:93)

at 
org.apache.giraph.edge.AbstractEdgeStore.addPartitionEdges(AbstractEdgeStore.java:166)

at 
org.apache.giraph.comm.requests.SendWorkerEdgesRequest.doRequest(SendWorkerEdgesRequest.java:72)

at 
org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:62)

at 
org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:36)

at 
org.apache.giraph.comm.netty.handler.RequestServerHandler.channelRead(RequestServerHandler.java:108)

at 
io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)

at 
io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)

at 
org.apache.giraph.comm.netty.handler.RequestDecoder.channelRead(RequestDecoder.java:100)

at 
io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)

at 
io.netty.channel.DefaultChannelHandlerContext.access$700(DefaultChannelHandlerContext.java:29)

at 
io.netty.channel.DefaultChannelHandlerContext$8.run(DefaultChannelHandlerContext.java:329)

at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:354)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:353)

at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run

RE: NegativeArraySizeException with large dataset

2014-09-08 Thread Pavan Kumar A
ByteArrayEdges or any of the other edge stores used array based/ map based 
stores, all of these will encounter this exception when size of the array 
approaches Integer.MAXsome things to consider for time being, what do your 
edges look like?if they are long ids  null values u can use LongNullArrayEdges 
to push the boundary a bit i.e, until u get a vertex who has ~2 billion 
outgoing edgesfor long ids  double values u can use LongDoubleArrayEdges etc.
please take a look at classes that implement this interface OutEdges
If none of those work, you can implement one of your ownand use a store backed 
by datastructures like BigDataOutput instead of plain old ByteArrays

From: and...@wizardapps.net
To: user@giraph.apache.org
Subject: NegativeArraySizeException with large dataset
Date: Mon, 8 Sep 2014 17:19:17 -0700






Hey,

 
I am currently running Giraph on a semi-large dataset of 600 million edges (the 
edges are directed, so I've used the ReverseEdgeDuplicator for an expected 
total of 1.2b edges). I am running into an issue during superstep -1 when the 
edges are being loaded-- I receive a java.lang.NegativeArraySizeException 
exception. This occurs near the end of when the edges should be done loading-- 
by my estimate, I believe around 1b out of the 1.2b have been loaded.

 
The exception occurs on one of the workers, and all of the other workers 
subsequently halt loading before I kill the job.

 
The issue doesn't occur with half of the dataset (300 million edges, 600 
million total with the reverser).

 
The only reference I've found to this particular exception type is GIRAPH-821 
(https://issues.apache.org/jira/browse/GIRAPH-821), which suggests to enable 
the useBigDataIOForMessages flag. I would be surprised if it helped, because 
this error occurs during the loading superstep, and there are no super 
vertices in my traversal computation. Enabling this flag had no effect.

 
Any help on this would be appreciated.

 
The full stack trace for the exception is as follows:

 
java.lang.NegativeArraySizeException

at 
org.apache.giraph.utils.UnsafeByteArrayOutputStream.ensureSize(UnsafeByteArrayOutputStream.java:116)

at 
org.apache.giraph.utils.UnsafeByteArrayOutputStream.write(UnsafeByteArrayOutputStream.java:167)

at org.apache.hadoop.io.Text.write(Text.java:282)

at 
org.apache.giraph.utils.WritableUtils.writeEdge(WritableUtils.java:501)

at org.apache.giraph.edge.ByteArrayEdges.add(ByteArrayEdges.java:93)

at 
org.apache.giraph.edge.AbstractEdgeStore.addPartitionEdges(AbstractEdgeStore.java:166)

at 
org.apache.giraph.comm.requests.SendWorkerEdgesRequest.doRequest(SendWorkerEdgesRequest.java:72)

at 
org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:62)

at 
org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:36)

at 
org.apache.giraph.comm.netty.handler.RequestServerHandler.channelRead(RequestServerHandler.java:108)

at 
io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)

at 
io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)

at 
org.apache.giraph.comm.netty.handler.RequestDecoder.channelRead(RequestDecoder.java:100)

at 
io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)

at 
io.netty.channel.DefaultChannelHandlerContext.access$700(DefaultChannelHandlerContext.java:29)

at 
io.netty.channel.DefaultChannelHandlerContext$8.run(DefaultChannelHandlerContext.java:329)

at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:354)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:353)

at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)

at java.lang.Thread.run(Thread.java:745)

 
-- 

Andrew


  

RE: n-ary relationship on Giraph

2014-05-21 Thread Pavan Kumar A

Can you please provide more context.
vertex - edge (edge value can store any properties required of that edge) - 
vertex (vertex value can store any property required for the vertex)
Date: Wed, 21 May 2014 13:50:34 -0700
From: sujanu...@yahoo.com
Subject: n-ary relationship on Giraph
To: user@giraph.apache.org

Hi,Does
 Giraph supports n-ary relationships? I need to store some properties of
 triplet vertex - edge - vertex and be able to query with those 
properties. Sujan Perera  

RE: n-ary relationship on Giraph

2014-05-21 Thread Pavan Kumar A
The state of triplet A CB can be stored in the edge value for C (the edge 
from A - B)I would like to remind you that Giraph is a batch processing 
framework, and not a graph database.You can do  complex graph processing on the 
input graph, such questions can be answered very trivially. But performance 
need not be great. You must write java code and run a map-reduce job.
For this case your compute function consists of just 1 superstepwhich filters 
edges for a vertex based on the criterion and then you can write the output 
back to one of the supported storage formats.

Date: Wed, 21 May 2014 16:32:44 -0700
From: sujanu...@yahoo.com
Subject: Re: n-ary relationship on Giraph
To: user@giraph.apache.org

Lets say I have node A and B, linked with edge C.Now I have properties which 
belongs to this A - C - B triplet. For example I have property 'date 
created'. 'date created' property belongs to A- C- B.Can I represent this in 
Giraph. Also does giraph has querying mechanism? So that I can retrieve 
triplets which are created before particular
 date?
 Sujan Perera
 On Wednesday, May 21, 2014 3:51 PM, Pavan Kumar A pava...@outlook.com 
wrote:


 Can you please provide more context.vertex - edge (edge value can store any 
properties required of that edge) - vertex (vertex value can store any 
property required for the vertex)Date: Wed, 21 May 2014 13:50:34 -0700From: 
sujanucsc@yahoo.comSubject: n-ary relationship on GiraphTo: 
user@giraph.apache.orgHi,Does
 Giraph supports n-ary relationships? I need to store some properties of
 triplet vertex - edge - vertex and be able to query with those 
properties. Sujan Perera  

  

RE: input superstep of giraph.

2014-04-18 Thread Pavan Kumar A
Btw, the jobs we run typically run for hours, so total time is mostly just sum 
of input + supersteps for us, since a very little extra time is negligible. 
However, I see your job itself is so small, so to be more accuratetotal time = 
time between start of your job (once all machines were allocated) and the end 
of your job (when u can see in logs that all workers are done with the last 
superstep)

From: pava...@outlook.com
To: user@giraph.apache.org
Subject: RE: input superstep of giraph.
Date: Fri, 18 Apr 2014 23:58:53 +0530




Please take a look at GIRAPH-838Note that there is a little window between end 
of one superstep  start of the other. So, this 120 s can be accounted for 
that. But what I meant was total time is as good as sum of input + other 
supersteps (though only approximately because of this slight extra time)

Date: Fri, 18 Apr 2014 16:28:48 +0100
Subject: Re: input superstep of giraph.
From: ghufran1ma...@gmail.com
To: user@giraph.apache.org

Hi Pavan, 
I might have misunderstood your explanation. But from the giraph timers I 
received, Total does not seem to be:

 sum of input time + sum of time in all supersteps



For example the following timers were outputted after I ran the 
ConnectedComponents algorithm: 

Giraph Timers
Initialize (ms)=424 Input superstep (ms)=1457   
Setup (ms)=85
Shutdown (ms)=11666 Superstep 0 
ConnectedComponentsComputation (ms)=903 Superstep 1 
ConnectedComponentsComputation (ms)=4565
Superstep 10 ConnectedComponentsComputation (ms)=475
Superstep 11 ConnectedComponentsComputation (ms)=454
Superstep 12 ConnectedComponentsComputation (ms)=342
Superstep 2 ConnectedComponentsComputation (ms)=3094
Superstep 3 ConnectedComponentsComputation (ms)=1399
Superstep 4 ConnectedComponentsComputation (ms)=783
Superstep 5 ConnectedComponentsComputation (ms)=591 
Superstep 6 ConnectedComponentsComputation (ms)=458
Superstep 7 ConnectedComponentsComputation (ms)=458 
Superstep 8 ConnectedComponentsComputation (ms)=483 Superstep 9 
ConnectedComponentsComputation (ms)=458
Total (ms)=27675
Input superstep = sum of input time? 
Input superstep + sum of supersteps = 1457 + 14463  
 = 15920


and the total is: 27675
there is still 11755 ms unaccounted for? 
Or have I miss understood what sum of input time should be? 

Kind regards, 

Ghufran 




On Fri, Apr 18, 2014 at 3:52 PM, ghufran malik ghufran1ma...@gmail.com wrote:

Hi, 
Thank you for the explanation :) 
It was confusing when reading it, some of the timers I can intuitively 
understand, however I think it would be beneficial if these explanations were 
added to the API docs, then if anyone else is confused they can look up the 
meanings there. 



https://giraph.apache.org/giraph-core/apidocs/org/apache/giraph/counters/GiraphTimers.html
 


Thanks, 
Ghufran

On Fri, Apr 18, 2014 at 3:25 PM, Pavan Kumar A pava...@outlook.com wrote:





I wrote the Initialize counter :) Please tell me if the name seems confusing
So,Initialize = the time spent by job waiting for resources. In a shared pool 
the job you launch may not get all the machines needed to start the job. So for 
instance you want to run a job with 200 workers, giraph does not start until 
all the workers have are allocated  register with the master.


Setup = once you have all the machines allocated, how much time it takes before 
starting to read input
Shutdown = once you have written your output howmuch time it takes to stop 
verify that everything is done  shutdown resources  notify user - for 
instance wait for all network connections to close, all threads to join, etc.


Total = sum of input time + sum of time in all superstepsi.e., actual time 
taken to run by your application after it got all the resources (does not 
include time waiting to get resources which is initialize or shutdown time)



Date: Fri, 18 Apr 2014 13:28:47 +0100
Subject: Re: input superstep of giraph.
From: ghufran1ma...@gmail.com
To: user@giraph.apache.org



Hi, 

Could you also explain what the following timers correspond to as well please: 

Giraph Timers   Initialize (ms)=775


Setup (ms)=105
Shutdown (ms)=12537

Total (ms)=27075
Thanks, 

Ghufran



On Thu, Apr 17, 2014 at 9:10 PM, Pavan Kumar A pava...@outlook.com wrote:






Input consists of  reading the input (vertices and/or edges as provided) into 
memory on individual workers assigning vertices to partitions and partitions 
to workers


 moving all partitions (i.e., vertices  their out-edges) to a worker (which 
 owns the partition) doing some bookkeeping of internal data-structures to be 
 used during computation

Date: Thu, 17 Apr

RE: input superstep of giraph.

2014-04-17 Thread Pavan Kumar A
Input consists of  reading the input (vertices and/or edges as provided) into 
memory on individual workers assigning vertices to partitions and partitions 
to workers moving all partitions (i.e., vertices  their out-edges) to a 
worker (which owns the partition) doing some bookkeeping of internal 
data-structures to be used during computation

Date: Thu, 17 Apr 2014 10:06:03 -0500
Subject: input superstep of giraph.
From: suijian.z...@gmail.com
To: user@giraph.apache.org

Hi, 
  From the screen output of a successful giraph program run, what does the 
following line mean?

Input superstep (ms)=22884


 Does it mean the time used to load the input graph into memory? Thanks.

  Best Regards,
  Suijian



  

RE: Giraph Buffer Size

2014-04-16 Thread Pavan Kumar A

What do u mean by buffer size? Just as a note, please ensure that Xmx  Xms 
values are properly set for the mapper using mapred.child.java.opts or 
mapred.map.child.java.optsAlso what does the error message show: please use 
pastebin  post the link here.
Date: Wed, 16 Apr 2014 12:13:29 +0530
Subject: Giraph Buffer Size
From: agrta.ra...@gmail.com
To: user@giraph.apache.org

Hi All,

I am trying to run a job in Giraph-1.0.0 on Hadoop-1.0.0 cluster with 3 nodes.
Each node has 32gb RAM.

In superstep-8 of my algorithm approximately 2M messages are being sent where 
size of each message is more than 20 kb. But the process sticks here and task 
gets failed.


In the sysout logs, it shows Fatal Error.

Is this error because Buffer is getting full?
How can I increase the buffer size for giraph application?

Please suggest.


Regards,
Agrta Rawat

  

RE: Can a vertex belong to more than one partition

2014-04-16 Thread Pavan Kumar A
Isn't graph isomorphism NP-Hard in general.I guess you already know that 
partitions in giraph does not mean actual graph partitioning / graph 
clustering -- for instance http://dl.acm.org/citation.cfm?id=2433461.It is 
just a concept of splitting the vertices into different compute segments for 
parallel  distributed processing of the graph.
Anyway there are multiple ways where in u can make the query graph available to 
all vertices of the big graphFor instance you can have a different inputformat 
defined for your query graph file that will read it as one vertex with a fixed 
id - say QueryGraph and any vertex that needs access to the graph can send it 
a message and receive the whole graph as response, for instance.
However, your requirements remind me of the Giraph++ work : 
http://researcher.watson.ibm.com/researcher/files/us-ytian/giraph++.pdfThis is 
not supported in Giraph yet. I guess you want to do some computation on the 
whole partition like compare the query graph with the entire partition of the 
big graph, etc. which is not so easy to do with the current api.
I might have misunderstood, please correct me if wrong.Thanks
 Date: Wed, 16 Apr 2014 09:55:18 +0530
 Subject: Re: Can a vertex belong to more than one partition
 From: trivedi.aksh...@gmail.com
 To: user@giraph.apache.org
 
 Hi,
 I am solving graph isomorhism between a large graph and query graph.
 The large graph is partitioned and so the query graph should be
 available to all partitions. Apart from this, some of the large graph
 vertices(such as those which have edges between partitions) also have
 to be duplicated.
 
 On Mon, Apr 7, 2014 at 9:53 PM, Pavan Kumar A pava...@outlook.com wrote:
  If you want the vertex.value to be available to all vertices, then you can
  store it in an aggregator.
  A vertex can belong to exactly one partition. But please answer Lukas's
  questions so we can answer more appropriately.
 
  Date: Mon, 7 Apr 2014 11:23:58 +0200
  From: lukas.naleze...@firma.seznam.cz
  To: user@giraph.apache.org
  Subject: Re: Can a vertex belong to more than one partition
 
 
  Hi,
 
  No, Vertex can belong only to one partition.
  Can you describe algorithm you are solving ? How many those vertexes
  belonging to all partitions you have ?
  Why do you need so strict partitioning ?
 
  Regards
  Lukas
 
 
  On 6.4.2014 12:38, Akshay Trivedi wrote:
   In order to custom partition the graph, WorkerGraphPartitioner has to
   be implemented. It has a method getPartitionOwner(I vertexId) which
   returns PartitionOwner of the vertex. I want that some vertices belong
   to all paritions i.e all PartitionOwners. Can anyone help me with it?
   Thankyou in advance
  
   Regards
   Akshay
 
  

RE: Changing index of a graph

2014-04-16 Thread Pavan Kumar A
It totally depends on the input distribution, one very simple thing that can be 
done is: Define  a VertexResolver that upon every vertex creation sets its Id 
= domain of url  value = set of urls in the domain; it keeps appending as 
more vertices with same id (i.e., domain) are read from input [Now you can 
ignore edges all together. All you are left with is these huge-vertices that 
are identified by domains  contain value = set of urls] Here, you can use 
aggregator approach of sending in the (domain, count of set) to master - these 
aggregators are then combined to give something like [(domain1, offset1), 
(domain2, offset2), etc.] all vertices (the huge ones) read this aggregator and 
figure out their offset then while u output just output the vertices in the set 
with number = offset + number in set
So u have a map now - though it is highly unstable because adding one more url 
to a domain later will change the order totally, that is when u can use id = 
domain + insert date, etc. [[which will stop working at some point because 
aggregator needs to carry huge messages, then the computation of offsets via 
aggregators needs to be done in multiple supersteps, etc.]]
Anyway now that you have the map of url - numberAll you got to do is a join 
-that's simpleread your original table + this map table in a single giraph 
joband u can use 2 supersteps to rename all the vertices properly

[[note that you can do another thing here as well, MUCH SIMPLER THAN ABOVE] 
superstep 0in your compute class have a thread local variable that increases 
for each vertex the thread computes, assign the value [(workerid,threadid) , 
number] to each vertex.
now aggregate {(workerid, threadid), number}
 superstep 1;masternow we have see [{(workerid, threadid), count in group}]so 
 recompute another aggregator which is like[{workerid, threadid), cumulative 
 sum upto now]send this aggregator to workers
workerread cumulative_sum from aggregator and add it up to each vertex's 
current value
when you output the graph this time as edgeoutput, sourceid, targetid are set 
as vertex values = the count
Date: Tue, 15 Apr 2014 23:40:39 +0200
Subject: Re: Changing index of a graph
From: mneum...@spotify.com
To: user@giraph.apache.org

I have a pipeline that creates a graph then does some transformations on it 
(with Giraph). In the end I want to dump it into Neo4j to allow for cypher 
queries.  
I was told that I could make the batch import for Neo4j a lot faster if I would 
use Long identifiers without holes, and therefore matching there internal ID 
space. If I understand it right they use it to build an on disk index with it 
using the ID's as offsets, that's why it should have no holes.

I didn't expect it to be so costly to change the index, but I guess this way I 
could at least spread the load to the cluster, since batch import happens on a 
single machine.

Thanks 4 the input, I will see what makes the most sense with the limited time 
I have.

On Tue, Apr 15, 2014 at 5:31 PM, Lukas Nalezenec 
lukas.naleze...@firma.seznam.cz wrote:


  

  
  
Hi, 

  I did same think in two M/R jobs during preprocesing - it was
  pretty powerful for web graphs but little bit slow.

  

  Solution for Giraph is:

  1. Implement own partition which will iterate vertices in order.
  Use appropriate partitioner.

  2. During first iteration you need to rename vertexes in each
  partition without holes. Holes will be only between partitions.

  At the end, get min and max vertex index for each partion,
  send it to master in aggregator and compute mapping required to
  delete holes.

  3. During second iteration iterate all vertexes and delete holes
  by shifting vertex indexes. 

  

  4.  rename edges (two more iterations)...

  

  Btw: Why do you need such indexes ? For HLL ?

  

  Lukas

  

  On 15.4.2014 15:33, Martin Neumann wrote:



  
  Hej,



I have a huge edgelist (several billion edges) where node
  ID's are URL's.
The algorithm I want to run needs the ID's to be long and
  there should be no holes in the ID space (so I cant simply
  hash the URL's).



Is anyone aware of a simple solution that does not require
  a impractical huge hash map?



My idea currently is to load the graph into another giraph
  job and then assigning a number to each node. This way the
  mapping of number to URL would be stored in the Node.
Problem is that I have to assign the numbers in a
  sequential way to ensure there are no holes and numbers are
  unique. No Idea if this is even possible in Giraph.



Any input is welcome

  


cheers Martin
  



  


  

RE: Giraph Buffer Size

2014-04-16 Thread Pavan Kumar A
Are you using Java 7? 

Date: Wed, 16 Apr 2014 13:07:20 +0530
Subject: Re: Giraph Buffer Size
From: agrta.ra...@gmail.com
To: user@giraph.apache.org

Hi Pavan,

For all the intermediate processing there would be a buffer (intermediate 
memory space) that stores data, messages etc.and then the complete process 
further.

Pls correct me if I am wrong.

I have set Xms and Xmx values properly.

The problem is that the task runs for small datasets but as the input data size 
is increased, it fails.


The error that I am getting in sysout logs is-

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x2b404144, pid=10397, tid=1144650048

#
# JRE version: 6.0_25-b06
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.0-b11 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J  sun.nio.ch.SelectorImpl.processDeregisterQueue()V
#
# An error report file with more information is saved as:

# /hadoopTaskTrackerLogsLocation/process_id/s_err_pid10397.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp



Please suggest what should be done? Am I missing anything?

Regards,
Agrta Rawat



On Wed, Apr 16, 2014 at 12:44 PM, Pavan Kumar A pava...@outlook.com wrote:





What do u mean by buffer size? Just as a note, please ensure that Xmx  Xms 
values are properly set for the mapper using mapred.child.java.opts or 
mapred.map.child.java.opts
Also what does the error message show: please use pastebin  post the link here.
Date: Wed, 16 Apr 2014 12:13:29 +0530
Subject: Giraph Buffer Size
From: agrta.ra...@gmail.com

To: user@giraph.apache.org

Hi All,

I am trying to run a job in Giraph-1.0.0 on Hadoop-1.0.0 cluster with 3 nodes.

Each node has 32gb RAM.

In superstep-8 of my algorithm approximately 2M messages are being sent where 
size of each message is more than 20 kb. But the process sticks here and task 
gets failed.


In the sysout logs, it shows Fatal Error.

Is this error because Buffer is getting full?
How can I increase the buffer size for giraph application?

Please suggest.



Regards,
Agrta Rawat

  

  

RE: Optimal number of Workers

2014-04-16 Thread Pavan Kumar A
Giraph uses threads for compute, netty server, netty client on workers, 
execution pools, input, output etc.You can see most of these options in 
org.apache.giraph.conf.GiraphConstants for instance
  /** Netty client threads */  IntConfOption NETTY_CLIENT_THREADS =  new 
IntConfOption(giraph.nettyClientThreads, 4, Netty client threads);
  /** Netty server threads */  IntConfOption NETTY_SERVER_THREADS =  new 
IntConfOption(giraph.nettyServerThreads, 16,  Netty server threads);
  /** Number of threads for vertex computation */  IntConfOption 
NUM_COMPUTE_THREADS =  new IntConfOption(giraph.numComputeThreads, 1, 
 Number of threads for vertex computation);
  /** Number of threads for input split loading */  IntConfOption 
NUM_INPUT_THREADS =  new IntConfOption(giraph.numInputThreads, 1, 
 Number of threads for input split loading);

The idea is that if you run your job in a cluster of 5 machines: typically 1 
machine is the master  4 of them are workers which load the graph  compute 
on it. Each worker is a separate machine and to maximize its utilization we can 
use as many threads as it can handle.
However, if you are running it in pseudo mode then all workers run on the same 
machine  still try to launch the number of threads (default set in the config) 
- though each worker is now a thread (instead of a machine) it still launches 
all these other threads unscrupulously. Anyway, u can configure these threads 
spawned by workers to reduce the over all number of threads launched in your 
one machine.
From: chadijaber...@hotmail.com
To: user@giraph.apache.org
Subject: Optimal number of Workers
Date: Tue, 15 Apr 2014 13:34:53 +0200




Hello !!Can anybody explain how threads are used by worker in Giraph ? for 
which purposes ? how the number of thread to use is determined by worker?
I often have the following error :org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: unable to create new native thread.
A check on the number of thread by worker gives child processes with 100 
threads by worker process (10 workers in a 12 processors machine), which is in 
my opinion too large isn't it ?if i reduce the number of workers , the number 
of threads decreases. How must we choose the number of workers?
Thanks in advance.Chadi


  

RE: clustering coefficient (counting triangles) in giraph.

2014-03-17 Thread Pavan Kumar A
If what you need is 
http://en.wikipedia.org/wiki/Clustering_coefficient#Local_clustering_coefficientthen
 I implemented it in Giraph, will submit a patch soon

Date: Mon, 17 Mar 2014 15:33:07 -0400
Subject: Re: clustering coefficient (counting triangles) in giraph.
From: kaushikpatn...@gmail.com
To: user@giraph.apache.org

Check out this paper on implementing triangle counting in a BSP model by Prof 
David Bader from Georgia Tech.

http://www.cc.gatech.edu/~bader/papers/GraphBSPonXMT-MTAAP2013.pdf


I implemented a similar version in Apache Giraph, and it worked pretty well. 
You have to switch on the write to disk option though, as in the second and 
third cycle of the algorithm you have a massive message build up.



On Mon, Mar 17, 2014 at 3:17 PM, Suijian Zhou suijian.z...@gmail.com wrote:

Hi, Experts,
  Does anybody know if there are examples of implementation in giraph for 
clustering coefficient (counting triangles)? Thanks!


  Best Regards,
  Suijian




  

RE: Running one compute function after another..

2014-01-11 Thread Pavan Kumar A
Jyoti - I recently did a similar thing. In fact, my approach was exactly what 
Maja suggested. However, there is a caveat. You can switch computation class 
for workers in mastercompute's compute method but that requires the messages 
sent by computation class active before switching and messages received by 
computation class after switching to be the same.
For instance 
Superstep 1 - Compute-A (M1)Superstep 2 - Compute-A (M1)Superstep 3 - 
Compute-B(receive M1, outgoing is M2) -- you can achieve this using 
AbstractComputation, instead of BasicComputation.However, if Compute-B needs to 
be used in superstep-4 as well i.e.Superstep 4 - Compute-B [it receives M2 but 
that conflicts with its definition]
So in this case the trick isSuperstep 1 - Compute-A (M1)Superstep 2 - Compute-A 
(M1)time to switchSuperstep 3 - NoOpMessageSink extends 
AbstractComputationI,V,E,M1,M2 whose compute() = { translate M1- M2}make the 
switchSuperstep 4 - Compute-B (M2)Superstep 5 - Compute-B (M2)
and so on.
If your compute functions change alternatively then u can extend 
AbstractComputation likeSuperstep 1 - Compute-A (extends AbstractComputation 
M1, M2)Superstep 2 - Compute-B (extends AbstractComputation M2, 
M1)Superstep 3 - Compute-A (extends AbstractComputation M1, M2)Superstep 4 - 
Compute-B (extends AbstractComputation M2, M1)
@Maja, please add-to /correct what I wrote.
Thanks.
From: majakabi...@fb.com
To: user@giraph.apache.org
Subject: Re: Running one compute function after another..
Date: Sat, 11 Jan 2014 19:01:08 +






Hi Jyoti,



A cleaner way to do this is to switch Computation class which is used in the 
moment your condition is satisfied. So you can have an aggregator to check 
whether the condition is met, and then in your MasterCompute you call 
setComputation(SecondComputationClass.class)
 when needed.



Regards,
Maja





From: Jyoti Yadav rao.jyoti26ya...@gmail.com

Reply-To: user@giraph.apache.org user@giraph.apache.org

Date: Saturday, January 11, 2014 10:48 AM

To: user@giraph.apache.org user@giraph.apache.org

Subject: Re: Running one compute function after another..









Hi ? ??π???...


I will go by this..


Thanks...






On Sat, Jan 11, 2014 at 10:52 PM, ? ??π??? 
ikapo...@csd.auth.gr wrote:


Hey,



You can have a boolean variable initially set to true(or false, whatever). Then 
you divide your code based on the value of that variable with an if-else 
statement. For my example, if the value is true then it goes through the first 
'if'. When the condition
 you want is fullfilled, change the value of the variable to false (at all 
nodes) and then the second part will be executed.



Ilias



 11/1/2014 6:18 ??, ?/? Jyoti Yadav ??:




Hi folks..





In my algorithm,all vertices execute one compute function upto certain 
condition, when that condition is fulfilled,i want that all vertices now 
execute another compute function.Is it possible??



Any ideas are highly appreciated..



Thanks

Jyoti













  

RE: Issues with Giraph v1.0.0

2013-12-13 Thread Pavan Kumar A
Hi Pankaj,
Note that in Giraph, vertex is the first-class citizen, while edges are just 
data associated with a vertex.So, when you delete a vertex you delete all data 
associated with it i.e., its outgoing edges, its value, its id, etc.
However, it is not trivial to delete all incoming edges to a vertex, since it 
is not directly aware of existence of such edges.Such edges can only be deleted 
at source-vertex.
Note that based on the type of outedges you use, Giraph can support multi-graph 
or simple graph [both being directed of course]. So, if you want to avoid 
duplicated edges you can set OutEdge type to LongDoubleHashMapEdges for 
example. 23 are related.
You can over-come all these at application layer by dedicating the first few 
supersteps to clean up your graph though.
Thanks.

From: pankajiit...@gmail.com
Date: Fri, 13 Dec 2013 13:25:49 +0530
Subject: Issues with Giraph v1.0.0
To: user@giraph.apache.org
CC: agrta.ra...@gmail.com

Hi,
I am facing following issues while using giraph-1.0.0-for-hadoop-1.0.2.
1. Deletion of a vertex does not delete the incoming edges of that vertex.
2.
 When removeVertexRequest(sourceId,targetId) method is used inside a for-loop, 
i.e., multiple delete requests are sent by a vertex, it deletes only the last 
edge identified by the for-loop.
3. Edge creation does not check for duplicate edges.



Any pointers or solution will be helpful.
Thanks in advance!
Regards,Pankaj

RE: vertex and data block co-location

2013-12-08 Thread Pavan Kumar A
@DavidYou can have a look at 
http://researcher.watson.ibm.com/researcher/files/us-ytian/giraph++.pdfThis 
work was done by 
http://researcher.watson.ibm.com/researcher/view.php?person=us-ytianIn this she 
talks about alternative partitioning schemes she implemented on top of giraph 
and the showsthe resulting optimizations taking some graph algorithms as 
examples.
Date: Sun, 8 Dec 2013 10:55:51 -0800
Subject: Re: vertex and data block co-location
From: apache.mail...@gmail.com
To: user@giraph.apache.org

Running Giraph on MapReduce, you have no control over where the worker tasks 
will be hosted on the cluster. Therefore the partitioning generally is not 
aware of co-located blocks and does a fair amount of time-consuming network 
shuffling of data during the initialization of a Giraph job.

What Giraph does do is, as each worker tasks spins up on the cluster, it 
attempts to claim input splits that happen to be local to the DataNode the 
worker runs on. This speeds up the initial injestion of graph data quite a bit, 
but does not help up much when it comes to distributing the data to the worker 
that owns that data's assigned partition.

Only when all data have been been pushed to the appropriate worker can the 
Giraph job actually begin. When data actually does end up belonging to a 
host-local partition it is not sent over the network, but in many cases there 
is no alternative without using an alternate to hash partitioning.


On Sat, Nov 16, 2013 at 12:22 PM, David J Garcia djch...@utexas.edu wrote:

hello, I was wondering if there was a way to ensure that vertices located on 
the same data block (on hdfs) are co-located with each other?


Also, will the vertices in input-splits (splits that are located on the same 
DataNode) have a reasonable chance of being partitioned to the same id?


for example, suppose that I have vertex_1 located on data_block_i, and vertex_2 
located on data_block_k.  Let's suppose that both of the data blocks are 
located on the same DataNode machine.  Is there a reasonably good chance that 
the vertex_1 and vertex_2 will partition to the same id?



I'm doing a research project and I'm trying to show the benefits of graph 
data-locality.


-David