RE: Best way to know the assignment of vertices to workers
I wrote a diff sometime ago where you can easily do that. You can find implementation details at - https://issues.apache.org/jira/browse/GIRAPH-908 https://reviews.apache.org/r/22234/ Some options you can use are -Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore -Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# Mapping tore ops information -Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps # Embed mapping information -Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge # PartitionerFactory to be used -Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory And like vertex input edge input we now have a mapping inputI only implemented all these for giraph-hive, so if u have a hive table with the mapping vertexId - workerNumthen u can pass the mapping input like org.apache.giraph.hive.input.mapping.examples.LongInt2ByteHiveToMapping, $mapping_table, $mapping_partition You can go through the code for each of these options to see what they do. Using this you can sort of pre-assign workers to vertex ids, now if u assign two vertices to a worker say worker-1, it is guaranteed they are both present in the same worker, the numbering (aka identification/naming) of workers is consistent (i.e, if a, b are assigned worker-x, they are guaranteed to be in the same worker but we do not know which worker that would be ahead in time), but cannot be explicitly set by the user. (which is what you want to do from what I can tell) If you are using something else, other than hive then you will have to implement all the interfaces of MappingInputFormat and then u can easily achieve what you want. From: kiran.garime...@aalto.fi To: user@giraph.apache.org Subject: Best way to know the assignment of vertices to workers Date: Fri, 28 Nov 2014 12:02:59 + Hi all, Is there a clean way to find out which worker a particular vertex is assigned to? From what I tried out, I found that given n workers, each node is assigned to the worker with id (vertex_id % n ). Is that a safe way to do this? I’ve had a look at previous discussions, but most of them have no answer. — Why I need it: In my application, each vertex needs to know some additional meta data, which is loaded from file. This metadata file is huge (50 G) and so, on each worker, I only want to load the metadata corresponding to the vertices present on that worker. — Previous discussions: 1. http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3C7EC16F82718A6D4A920A99FE46CE7F4E2861F779%40MERCMBX19R.na.SAS.com%3E 2. http://mail-archives.apache.org/mod_mbox/giraph-user/201403.mbox/%3CCAMf08QYE%2BRgUv9otXT6oPJorTNjQ-Ay8p4NUiuhds8%2BzgDzs1w%40mail.gmail.com%3E Regards, Kiran
RE: Best way to know the assignment of vertices to workers
I looked at the code again does not seem like workerList is sorted, etc. so by knowing a worker number there is no consistent way to tell the actual worker details each time. Lukas was working on such a diff sometime back. Perhaps he can answer more. From: pava...@outlook.com To: user@giraph.apache.org Subject: RE: Best way to know the assignment of vertices to workers Date: Sat, 29 Nov 2014 11:23:39 +0530 I wrote a diff sometime ago where you can easily do that. You can find implementation details at - https://issues.apache.org/jira/browse/GIRAPH-908 https://reviews.apache.org/r/22234/ Some options you can use are -Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore -Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# Mapping tore ops information -Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps # Embed mapping information -Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge # PartitionerFactory to be used -Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory And like vertex input edge input we now have a mapping inputI only implemented all these for giraph-hive, so if u have a hive table with the mapping vertexId - workerNumthen u can pass the mapping input like org.apache.giraph.hive.input.mapping.examples.LongInt2ByteHiveToMapping, $mapping_table, $mapping_partition You can go through the code for each of these options to see what they do. Using this you can sort of pre-assign workers to vertex ids, now if u assign two vertices to a worker say worker-1, it is guaranteed they are both present in the same worker, the numbering (aka identification/naming) of workers is consistent (i.e, if a, b are assigned worker-x, they are guaranteed to be in the same worker but we do not know which worker that would be ahead in time), but cannot be explicitly set by the user. (which is what you want to do from what I can tell) If you are using something else, other than hive then you will have to implement all the interfaces of MappingInputFormat and then u can easily achieve what you want. From: kiran.garime...@aalto.fi To: user@giraph.apache.org Subject: Best way to know the assignment of vertices to workers Date: Fri, 28 Nov 2014 12:02:59 + Hi all, Is there a clean way to find out which worker a particular vertex is assigned to? From what I tried out, I found that given n workers, each node is assigned to the worker with id (vertex_id % n ). Is that a safe way to do this? I’ve had a look at previous discussions, but most of them have no answer. — Why I need it: In my application, each vertex needs to know some additional meta data, which is loaded from file. This metadata file is huge (50 G) and so, on each worker, I only want to load the metadata corresponding to the vertices present on that worker. — Previous discussions: 1. http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3C7EC16F82718A6D4A920A99FE46CE7F4E2861F779%40MERCMBX19R.na.SAS.com%3E 2. http://mail-archives.apache.org/mod_mbox/giraph-user/201403.mbox/%3CCAMf08QYE%2BRgUv9otXT6oPJorTNjQ-Ay8p4NUiuhds8%2BzgDzs1w%40mail.gmail.com%3E Regards, Kiran
RE: Graph partitioning and data locality
You can also look at https://issues.apache.org/jira/browse/GIRAPH-908which solves the case where you have a partition map and would like graph to be partitioned that way after loading the input. It does not however solve the {do not shuffle data part} From: claudio.marte...@gmail.com Date: Tue, 4 Nov 2014 16:20:21 +0100 Subject: Re: Graph partitioning and data locality To: user@giraph.apache.org Hi, answers are inline. On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns martin.jungha...@gmx.net wrote: Hi group, I got a question concerning the graph partitioning step. If I understood the code correctly, the graph is distributed to n partitions by using vertexID.hashCode() n. I got two questions concerning that step. 1) Is the whole graph loaded and partitioned only by the Master? This would mean, the whole data has to be moved to that Master map job and then moved to the physical node the specific worker for the partition runs on. As this sounds like a huge overhead, I further inspected the code: I saw that there is also a WorkerGraphPartitioner and I assume he calls the partitioning method on his local data (lets say his local HDFS blocks) and if the resulting partition for a vertex is not himself, the data gets moved to that worker, which reduces the overhead. Is this assumption correct? That is correct, workers forward vertex data to the correct worker who is responsible for that vertex via hash-partitioning (by default), meaning that the master is not involved. 2) Let's say the graph is already partitioned in the file system, e.g. blocks on physical nodes contain logical connected graph nodes. Is it possible to just read the data as it is and skip the partitioning step? In that case I currently assume, that the vertexID should contain the partitionID and the custom partitioning would be an identity function in that case (instead of hashing or range). In principle you can. You would need to organize splits so that they contain all the data for each particular worker, and then assign relevant splits to the corresponding worker. Thanks for your time and help! Cheers, Martin -- Claudio Martella
RE: Using a custom graph partitioning stratergy with giraph
I will write a detailed explanation in weekend. thanks for your interest. Date: Wed, 1 Oct 2014 10:56:16 -0700 Subject: Re: Using a custom graph partitioning stratergy with giraph From: charith.dhanus...@gmail.com To: user@giraph.apache.org Thanks Pavan, I get the high level level idea. I am still new to Giraph code base so I am still trying to understand the overall design. So I have few questions regarding this feature. Can we use this feature with a vertex input format without using edge translation? (Since getPartition in MappingStoreOps can be used to get the partition of any target vertex) Also, since I have mapping information in a separate file do I need to embed target information in the vertex? It will be great if you could explain your scenario with dataformat you used and what extension points you used so that I can understand it better and adapt to my scenario. Thanks,Charith On Mon, Sep 29, 2014 at 3:34 PM, Pavan Kumar A pava...@outlook.com wrote: we have two inputs - vertex edgesif we partition edges vertices based on a map, then when we want to send messages we should be able to know which partition a vertex is on. typically we send messages to targetIds of outgoing edges, edge transation helps encode mapping information into targetIds, so knowing which partition to send a message can be done by just looking at the targetid Date: Mon, 29 Sep 2014 14:37:22 -0700 Subject: Re: Using a custom graph partitioning stratergy with giraph From: charith.dhanus...@gmail.com To: user@giraph.apache.org Hi Pavan, Thanks for the details. I went through the code specially the extension points you mentioned. I am not clear about the function of the edge Translation (org.apache.giraph.mapping.translate.TranslateEdge) class. Could you please explain the idea of this translation process. In my case I will have a mapping file which maps each vertex to a partition. ex: v1 part1v2 part2v3 part3 ... So I was thinking of passing this as a parameter and reading inside my own MappingStore Implementation -Dgiraph.mappingFilePath=/user/charith/input/mapping.txt Is there a better approach? Thanks,Charith On Sun, Sep 28, 2014 at 8:29 AM, Pavan Kumar A pava...@outlook.com wrote: I worked on this feature sometime back - but I only worked on inputting hive file not hdfs You can use logic outside giraph to select which partition file to use - this is possible because you input the number of workers anyway.For instance in the script that you use to launch a giraph job have a selection logic for the partition file You can take a look at : https://issues.apache.org/jira/browse/GIRAPH-908You might have to extend upon the jira for your specific use case - I only added support for case when id = longwritable Here is a list of options you might want to explore # Mapping Store related information -Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore -Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# Mapping tore ops information -Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps # Embed mapping information -Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge # PartitionerFactory to be used -Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory So the partition map is stored here as map of byte arrays. with .lbMappingStoreUpper being size of map and lbMappingStoreLower being size of individual arrays Please explore code tell me what else you need.ThanksDate: Sat, 27 Sep 2014 22:51:29 -0700 Subject: Re: Using a custom graph partitioning stratergy with giraph From: charith.dhanus...@gmail.com To: user@giraph.apache.org Also adding some more information. My current understanding is I should be able to do this by my own org.apache.giraph.partition.WorkerGraphPartitioner implementation. But my question is, Is there are a way to get some outside input inside the WorkerGraphPartitioner?In my case it will be an hdfs file location. Thanks,Charith On Sat, Sep 27, 2014 at 10:13 PM, Charith Wickramarachchi charith.dhanus...@gmail.com wrote: Hi, I m trying to use giraph with a custom graph partitioner that I have. In my case i want to assign vertices to workers based on a custom partitioner input. In my case partitioner will take number of workers as an input parameter and give me a file which maps each vertex id to a worker. I m trying load this file to a hdfs location and use it as an input to the giraph and do the vertex assignment. Any suggestions or pointers on best way to this will be highly appricated (Use the current extention points of giraph as much as possible to avoid random hacks). I m currently using giraph-1.0.0. Thanks,Charith -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253Web http://apache.org/~charithBlog http
RE: Graph re-partitioning
If you are using hashpartitioning, then as long as number of workers is same, partitions will remain unchanged, though they might run on a different worker. However, yes graph is always partitioned. Date: Mon, 29 Sep 2014 15:01:37 -0400 Subject: Graph re-partitioning From: xuhongne...@gmail.com To: user@giraph.apache.org Hello, Will Giraph re-partition the graph each time running a job on this graph? Is there anyway to directly load the partitioned graph from last job? Thanks -- Xuhong Zhang
RE: Using a custom graph partitioning stratergy with giraph
we have two inputs - vertex edgesif we partition edges vertices based on a map, then when we want to send messages we should be able to know which partition a vertex is on. typically we send messages to targetIds of outgoing edges, edge transation helps encode mapping information into targetIds, so knowing which partition to send a message can be done by just looking at the targetid Date: Mon, 29 Sep 2014 14:37:22 -0700 Subject: Re: Using a custom graph partitioning stratergy with giraph From: charith.dhanus...@gmail.com To: user@giraph.apache.org Hi Pavan, Thanks for the details. I went through the code specially the extension points you mentioned. I am not clear about the function of the edge Translation (org.apache.giraph.mapping.translate.TranslateEdge) class. Could you please explain the idea of this translation process. In my case I will have a mapping file which maps each vertex to a partition. ex: v1 part1v2 part2v3 part3 ... So I was thinking of passing this as a parameter and reading inside my own MappingStore Implementation -Dgiraph.mappingFilePath=/user/charith/input/mapping.txt Is there a better approach? Thanks,Charith On Sun, Sep 28, 2014 at 8:29 AM, Pavan Kumar A pava...@outlook.com wrote: I worked on this feature sometime back - but I only worked on inputting hive file not hdfs You can use logic outside giraph to select which partition file to use - this is possible because you input the number of workers anyway.For instance in the script that you use to launch a giraph job have a selection logic for the partition file You can take a look at : https://issues.apache.org/jira/browse/GIRAPH-908You might have to extend upon the jira for your specific use case - I only added support for case when id = longwritable Here is a list of options you might want to explore # Mapping Store related information -Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore -Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# Mapping tore ops information -Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps # Embed mapping information -Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge # PartitionerFactory to be used -Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory So the partition map is stored here as map of byte arrays. with .lbMappingStoreUpper being size of map and lbMappingStoreLower being size of individual arrays Please explore code tell me what else you need.ThanksDate: Sat, 27 Sep 2014 22:51:29 -0700 Subject: Re: Using a custom graph partitioning stratergy with giraph From: charith.dhanus...@gmail.com To: user@giraph.apache.org Also adding some more information. My current understanding is I should be able to do this by my own org.apache.giraph.partition.WorkerGraphPartitioner implementation. But my question is, Is there are a way to get some outside input inside the WorkerGraphPartitioner?In my case it will be an hdfs file location. Thanks,Charith On Sat, Sep 27, 2014 at 10:13 PM, Charith Wickramarachchi charith.dhanus...@gmail.com wrote: Hi, I m trying to use giraph with a custom graph partitioner that I have. In my case i want to assign vertices to workers based on a custom partitioner input. In my case partitioner will take number of workers as an input parameter and give me a file which maps each vertex id to a worker. I m trying load this file to a hdfs location and use it as an input to the giraph and do the vertex assignment. Any suggestions or pointers on best way to this will be highly appricated (Use the current extention points of giraph as much as possible to avoid random hacks). I m currently using giraph-1.0.0. Thanks,Charith -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253Web http://apache.org/~charithBlog http://charith.wickramaarachchi.org/Twitter @charithwiki This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may havereceived this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253Web http://apache.org/~charithBlog http://charith.wickramaarachchi.org/Twitter @charithwiki This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may havereceived
RE: Using a custom graph partitioning stratergy with giraph
I worked on this feature sometime back - but I only worked on inputting hive file not hdfs You can use logic outside giraph to select which partition file to use - this is possible because you input the number of workers anyway.For instance in the script that you use to launch a giraph job have a selection logic for the partition file You can take a look at : https://issues.apache.org/jira/browse/GIRAPH-908You might have to extend upon the jira for your specific use case - I only added support for case when id = longwritable Here is a list of options you might want to explore # Mapping Store related information -Dgiraph.mappingStoreClass=org.apache.giraph.mapping.LongByteMappingStore -Dgiraph.lbMappingStoreUpper=1987000-Dgiraph.lbMappingStoreLower=4096# Mapping tore ops information -Dgiraph.mappingStoreOpsClass=org.apache.giraph.mapping.DefaultEmbeddedLongByteOps # Embed mapping information -Dgiraph.edgeTranslationClass=org.apache.giraph.mapping.translate.LongByteTranslateEdge # PartitionerFactory to be used -Dgiraph.graphPartitionerFactoryClass=org.apache.giraph.partition.LongMappingStorePartitionerFactory So the partition map is stored here as map of byte arrays. with .lbMappingStoreUpper being size of map and lbMappingStoreLower being size of individual arrays Please explore code tell me what else you need.ThanksDate: Sat, 27 Sep 2014 22:51:29 -0700 Subject: Re: Using a custom graph partitioning stratergy with giraph From: charith.dhanus...@gmail.com To: user@giraph.apache.org Also adding some more information. My current understanding is I should be able to do this by my own org.apache.giraph.partition.WorkerGraphPartitioner implementation. But my question is, Is there are a way to get some outside input inside the WorkerGraphPartitioner?In my case it will be an hdfs file location. Thanks,Charith On Sat, Sep 27, 2014 at 10:13 PM, Charith Wickramarachchi charith.dhanus...@gmail.com wrote: Hi, I m trying to use giraph with a custom graph partitioner that I have. In my case i want to assign vertices to workers based on a custom partitioner input. In my case partitioner will take number of workers as an input parameter and give me a file which maps each vertex id to a worker. I m trying load this file to a hdfs location and use it as an input to the giraph and do the vertex assignment. Any suggestions or pointers on best way to this will be highly appricated (Use the current extention points of giraph as much as possible to avoid random hacks). I m currently using giraph-1.0.0. Thanks,Charith -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253Web http://apache.org/~charithBlog http://charith.wickramaarachchi.org/Twitter @charithwiki This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may havereceived this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253Web http://apache.org/~charithBlog http://charith.wickramaarachchi.org/Twitter @charithwiki This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may havereceived this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions
RE: receiving messages that I didn't send
Can you give more context?What are the types of messages, patch of your compute method, etc.You will not receive messages that are not sent, but one thing that can happen is-- message can have multiple parameters.suppose message objects can have 2 parametersm - a,bsay in m's write(out) you do not handle the case of b = nullm1 sets bm2 has b=nullthen because of incorrect code for m's write() m2 can show b = m1.bthat is because message objects will be re-used when receiving. This is a Giraph gotcha, because ofobject reuse in most iterators. Thanks From: m...@matthewcornell.org Date: Tue, 23 Sep 2014 10:10:48 -0400 Subject: receiving messages that I didn't send To: user@giraph.apache.org Hi Folks. I am refactoring my compute() to use a set of ids as its message type, and in my tests it is receiving a message that it absolutely did not send. I've debugged it and am at a loss. Interestingly, I encountered this once before and solved it by creating a copy of a Writeable instead of re-using it, but I haven't been able to solve it this time. In general, does this anomalous behavior indicate a Giraph/Hadoop gotcha'? It's really confounding! Thank very much -- matt -- Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson Street, Amherst MA 01002 | matthewcornell.org
RE: NegativeArraySizeException with large dataset
yes, you should implement your own edge store.please take a look at ByteArrayEdges for example and modify it to use BigDataOutput BigDataInput instead of ExtendedByteArrayOutput/Input. From: and...@wizardapps.net To: user@giraph.apache.org Subject: Re: NegativeArraySizeException with large dataset Date: Tue, 9 Sep 2014 09:58:36 -0700 Great, thanks for pointing me in the right direction. All of the edge values are strings (in a Text object) and point to and from vertices with Text IDs, but none of the values should be greater than 60 bytes or so during the loading step. The size will increase during computation because I am modifying the values of the edges, but the actual size of the data is not too large. Given that I am using text based IDs and values, it looks to me like I may have to implement my own edge store-- does that seem right? Thank you for your help! -- Andrew On Mon, Sep 8, 2014, at 05:31 PM, Pavan Kumar A wrote: ByteArrayEdges or any of the other edge stores used array based/ map based stores, all of these will encounter this exception when size of the array approaches Integer.MAX some things to consider for time being, what do your edges look like? if they are long ids null values u can use LongNullArrayEdges to push the boundary a bit i.e, until u get a vertex who has ~2 billion outgoing edges for long ids double values u can use LongDoubleArrayEdges etc. please take a look at classes that implement this interface OutEdges If none of those work, you can implement one of your own and use a store backed by datastructures like BigDataOutput instead of plain old ByteArrays From: and...@wizardapps.net To: user@giraph.apache.org Subject: NegativeArraySizeException with large dataset Date: Mon, 8 Sep 2014 17:19:17 -0700 Hey, I am currently running Giraph on a semi-large dataset of 600 million edges (the edges are directed, so I've used the ReverseEdgeDuplicator for an expected total of 1.2b edges). I am running into an issue during superstep -1 when the edges are being loaded-- I receive a java.lang.NegativeArraySizeException exception. This occurs near the end of when the edges should be done loading-- by my estimate, I believe around 1b out of the 1.2b have been loaded. The exception occurs on one of the workers, and all of the other workers subsequently halt loading before I kill the job. The issue doesn't occur with half of the dataset (300 million edges, 600 million total with the reverser). The only reference I've found to this particular exception type is GIRAPH-821 (https://issues.apache.org/jira/browse/GIRAPH-821), which suggests to enable the useBigDataIOForMessages flag. I would be surprised if it helped, because this error occurs during the loading superstep, and there are no super vertices in my traversal computation. Enabling this flag had no effect. Any help on this would be appreciated. The full stack trace for the exception is as follows: java.lang.NegativeArraySizeException at org.apache.giraph.utils.UnsafeByteArrayOutputStream.ensureSize(UnsafeByteArrayOutputStream.java:116) at org.apache.giraph.utils.UnsafeByteArrayOutputStream.write(UnsafeByteArrayOutputStream.java:167) at org.apache.hadoop.io.Text.write(Text.java:282) at org.apache.giraph.utils.WritableUtils.writeEdge(WritableUtils.java:501) at org.apache.giraph.edge.ByteArrayEdges.add(ByteArrayEdges.java:93) at org.apache.giraph.edge.AbstractEdgeStore.addPartitionEdges(AbstractEdgeStore.java:166) at org.apache.giraph.comm.requests.SendWorkerEdgesRequest.doRequest(SendWorkerEdgesRequest.java:72) at org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:62) at org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:36) at org.apache.giraph.comm.netty.handler.RequestServerHandler.channelRead(RequestServerHandler.java:108) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) at org.apache.giraph.comm.netty.handler.RequestDecoder.channelRead(RequestDecoder.java:100) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.access$700(DefaultChannelHandlerContext.java:29) at io.netty.channel.DefaultChannelHandlerContext$8.run(DefaultChannelHandlerContext.java:329) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:354) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run
RE: NegativeArraySizeException with large dataset
ByteArrayEdges or any of the other edge stores used array based/ map based stores, all of these will encounter this exception when size of the array approaches Integer.MAXsome things to consider for time being, what do your edges look like?if they are long ids null values u can use LongNullArrayEdges to push the boundary a bit i.e, until u get a vertex who has ~2 billion outgoing edgesfor long ids double values u can use LongDoubleArrayEdges etc. please take a look at classes that implement this interface OutEdges If none of those work, you can implement one of your ownand use a store backed by datastructures like BigDataOutput instead of plain old ByteArrays From: and...@wizardapps.net To: user@giraph.apache.org Subject: NegativeArraySizeException with large dataset Date: Mon, 8 Sep 2014 17:19:17 -0700 Hey, I am currently running Giraph on a semi-large dataset of 600 million edges (the edges are directed, so I've used the ReverseEdgeDuplicator for an expected total of 1.2b edges). I am running into an issue during superstep -1 when the edges are being loaded-- I receive a java.lang.NegativeArraySizeException exception. This occurs near the end of when the edges should be done loading-- by my estimate, I believe around 1b out of the 1.2b have been loaded. The exception occurs on one of the workers, and all of the other workers subsequently halt loading before I kill the job. The issue doesn't occur with half of the dataset (300 million edges, 600 million total with the reverser). The only reference I've found to this particular exception type is GIRAPH-821 (https://issues.apache.org/jira/browse/GIRAPH-821), which suggests to enable the useBigDataIOForMessages flag. I would be surprised if it helped, because this error occurs during the loading superstep, and there are no super vertices in my traversal computation. Enabling this flag had no effect. Any help on this would be appreciated. The full stack trace for the exception is as follows: java.lang.NegativeArraySizeException at org.apache.giraph.utils.UnsafeByteArrayOutputStream.ensureSize(UnsafeByteArrayOutputStream.java:116) at org.apache.giraph.utils.UnsafeByteArrayOutputStream.write(UnsafeByteArrayOutputStream.java:167) at org.apache.hadoop.io.Text.write(Text.java:282) at org.apache.giraph.utils.WritableUtils.writeEdge(WritableUtils.java:501) at org.apache.giraph.edge.ByteArrayEdges.add(ByteArrayEdges.java:93) at org.apache.giraph.edge.AbstractEdgeStore.addPartitionEdges(AbstractEdgeStore.java:166) at org.apache.giraph.comm.requests.SendWorkerEdgesRequest.doRequest(SendWorkerEdgesRequest.java:72) at org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:62) at org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:36) at org.apache.giraph.comm.netty.handler.RequestServerHandler.channelRead(RequestServerHandler.java:108) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) at org.apache.giraph.comm.netty.handler.RequestDecoder.channelRead(RequestDecoder.java:100) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.access$700(DefaultChannelHandlerContext.java:29) at io.netty.channel.DefaultChannelHandlerContext$8.run(DefaultChannelHandlerContext.java:329) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:354) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) at java.lang.Thread.run(Thread.java:745) -- Andrew
RE: n-ary relationship on Giraph
Can you please provide more context. vertex - edge (edge value can store any properties required of that edge) - vertex (vertex value can store any property required for the vertex) Date: Wed, 21 May 2014 13:50:34 -0700 From: sujanu...@yahoo.com Subject: n-ary relationship on Giraph To: user@giraph.apache.org Hi,Does Giraph supports n-ary relationships? I need to store some properties of triplet vertex - edge - vertex and be able to query with those properties. Sujan Perera
RE: n-ary relationship on Giraph
The state of triplet A CB can be stored in the edge value for C (the edge from A - B)I would like to remind you that Giraph is a batch processing framework, and not a graph database.You can do complex graph processing on the input graph, such questions can be answered very trivially. But performance need not be great. You must write java code and run a map-reduce job. For this case your compute function consists of just 1 superstepwhich filters edges for a vertex based on the criterion and then you can write the output back to one of the supported storage formats. Date: Wed, 21 May 2014 16:32:44 -0700 From: sujanu...@yahoo.com Subject: Re: n-ary relationship on Giraph To: user@giraph.apache.org Lets say I have node A and B, linked with edge C.Now I have properties which belongs to this A - C - B triplet. For example I have property 'date created'. 'date created' property belongs to A- C- B.Can I represent this in Giraph. Also does giraph has querying mechanism? So that I can retrieve triplets which are created before particular date? Sujan Perera On Wednesday, May 21, 2014 3:51 PM, Pavan Kumar A pava...@outlook.com wrote: Can you please provide more context.vertex - edge (edge value can store any properties required of that edge) - vertex (vertex value can store any property required for the vertex)Date: Wed, 21 May 2014 13:50:34 -0700From: sujanucsc@yahoo.comSubject: n-ary relationship on GiraphTo: user@giraph.apache.orgHi,Does Giraph supports n-ary relationships? I need to store some properties of triplet vertex - edge - vertex and be able to query with those properties. Sujan Perera
RE: input superstep of giraph.
Btw, the jobs we run typically run for hours, so total time is mostly just sum of input + supersteps for us, since a very little extra time is negligible. However, I see your job itself is so small, so to be more accuratetotal time = time between start of your job (once all machines were allocated) and the end of your job (when u can see in logs that all workers are done with the last superstep) From: pava...@outlook.com To: user@giraph.apache.org Subject: RE: input superstep of giraph. Date: Fri, 18 Apr 2014 23:58:53 +0530 Please take a look at GIRAPH-838Note that there is a little window between end of one superstep start of the other. So, this 120 s can be accounted for that. But what I meant was total time is as good as sum of input + other supersteps (though only approximately because of this slight extra time) Date: Fri, 18 Apr 2014 16:28:48 +0100 Subject: Re: input superstep of giraph. From: ghufran1ma...@gmail.com To: user@giraph.apache.org Hi Pavan, I might have misunderstood your explanation. But from the giraph timers I received, Total does not seem to be: sum of input time + sum of time in all supersteps For example the following timers were outputted after I ran the ConnectedComponents algorithm: Giraph Timers Initialize (ms)=424 Input superstep (ms)=1457 Setup (ms)=85 Shutdown (ms)=11666 Superstep 0 ConnectedComponentsComputation (ms)=903 Superstep 1 ConnectedComponentsComputation (ms)=4565 Superstep 10 ConnectedComponentsComputation (ms)=475 Superstep 11 ConnectedComponentsComputation (ms)=454 Superstep 12 ConnectedComponentsComputation (ms)=342 Superstep 2 ConnectedComponentsComputation (ms)=3094 Superstep 3 ConnectedComponentsComputation (ms)=1399 Superstep 4 ConnectedComponentsComputation (ms)=783 Superstep 5 ConnectedComponentsComputation (ms)=591 Superstep 6 ConnectedComponentsComputation (ms)=458 Superstep 7 ConnectedComponentsComputation (ms)=458 Superstep 8 ConnectedComponentsComputation (ms)=483 Superstep 9 ConnectedComponentsComputation (ms)=458 Total (ms)=27675 Input superstep = sum of input time? Input superstep + sum of supersteps = 1457 + 14463 = 15920 and the total is: 27675 there is still 11755 ms unaccounted for? Or have I miss understood what sum of input time should be? Kind regards, Ghufran On Fri, Apr 18, 2014 at 3:52 PM, ghufran malik ghufran1ma...@gmail.com wrote: Hi, Thank you for the explanation :) It was confusing when reading it, some of the timers I can intuitively understand, however I think it would be beneficial if these explanations were added to the API docs, then if anyone else is confused they can look up the meanings there. https://giraph.apache.org/giraph-core/apidocs/org/apache/giraph/counters/GiraphTimers.html Thanks, Ghufran On Fri, Apr 18, 2014 at 3:25 PM, Pavan Kumar A pava...@outlook.com wrote: I wrote the Initialize counter :) Please tell me if the name seems confusing So,Initialize = the time spent by job waiting for resources. In a shared pool the job you launch may not get all the machines needed to start the job. So for instance you want to run a job with 200 workers, giraph does not start until all the workers have are allocated register with the master. Setup = once you have all the machines allocated, how much time it takes before starting to read input Shutdown = once you have written your output howmuch time it takes to stop verify that everything is done shutdown resources notify user - for instance wait for all network connections to close, all threads to join, etc. Total = sum of input time + sum of time in all superstepsi.e., actual time taken to run by your application after it got all the resources (does not include time waiting to get resources which is initialize or shutdown time) Date: Fri, 18 Apr 2014 13:28:47 +0100 Subject: Re: input superstep of giraph. From: ghufran1ma...@gmail.com To: user@giraph.apache.org Hi, Could you also explain what the following timers correspond to as well please: Giraph Timers Initialize (ms)=775 Setup (ms)=105 Shutdown (ms)=12537 Total (ms)=27075 Thanks, Ghufran On Thu, Apr 17, 2014 at 9:10 PM, Pavan Kumar A pava...@outlook.com wrote: Input consists of reading the input (vertices and/or edges as provided) into memory on individual workers assigning vertices to partitions and partitions to workers moving all partitions (i.e., vertices their out-edges) to a worker (which owns the partition) doing some bookkeeping of internal data-structures to be used during computation Date: Thu, 17 Apr
RE: input superstep of giraph.
Input consists of reading the input (vertices and/or edges as provided) into memory on individual workers assigning vertices to partitions and partitions to workers moving all partitions (i.e., vertices their out-edges) to a worker (which owns the partition) doing some bookkeeping of internal data-structures to be used during computation Date: Thu, 17 Apr 2014 10:06:03 -0500 Subject: input superstep of giraph. From: suijian.z...@gmail.com To: user@giraph.apache.org Hi, From the screen output of a successful giraph program run, what does the following line mean? Input superstep (ms)=22884 Does it mean the time used to load the input graph into memory? Thanks. Best Regards, Suijian
RE: Giraph Buffer Size
What do u mean by buffer size? Just as a note, please ensure that Xmx Xms values are properly set for the mapper using mapred.child.java.opts or mapred.map.child.java.optsAlso what does the error message show: please use pastebin post the link here. Date: Wed, 16 Apr 2014 12:13:29 +0530 Subject: Giraph Buffer Size From: agrta.ra...@gmail.com To: user@giraph.apache.org Hi All, I am trying to run a job in Giraph-1.0.0 on Hadoop-1.0.0 cluster with 3 nodes. Each node has 32gb RAM. In superstep-8 of my algorithm approximately 2M messages are being sent where size of each message is more than 20 kb. But the process sticks here and task gets failed. In the sysout logs, it shows Fatal Error. Is this error because Buffer is getting full? How can I increase the buffer size for giraph application? Please suggest. Regards, Agrta Rawat
RE: Can a vertex belong to more than one partition
Isn't graph isomorphism NP-Hard in general.I guess you already know that partitions in giraph does not mean actual graph partitioning / graph clustering -- for instance http://dl.acm.org/citation.cfm?id=2433461.It is just a concept of splitting the vertices into different compute segments for parallel distributed processing of the graph. Anyway there are multiple ways where in u can make the query graph available to all vertices of the big graphFor instance you can have a different inputformat defined for your query graph file that will read it as one vertex with a fixed id - say QueryGraph and any vertex that needs access to the graph can send it a message and receive the whole graph as response, for instance. However, your requirements remind me of the Giraph++ work : http://researcher.watson.ibm.com/researcher/files/us-ytian/giraph++.pdfThis is not supported in Giraph yet. I guess you want to do some computation on the whole partition like compare the query graph with the entire partition of the big graph, etc. which is not so easy to do with the current api. I might have misunderstood, please correct me if wrong.Thanks Date: Wed, 16 Apr 2014 09:55:18 +0530 Subject: Re: Can a vertex belong to more than one partition From: trivedi.aksh...@gmail.com To: user@giraph.apache.org Hi, I am solving graph isomorhism between a large graph and query graph. The large graph is partitioned and so the query graph should be available to all partitions. Apart from this, some of the large graph vertices(such as those which have edges between partitions) also have to be duplicated. On Mon, Apr 7, 2014 at 9:53 PM, Pavan Kumar A pava...@outlook.com wrote: If you want the vertex.value to be available to all vertices, then you can store it in an aggregator. A vertex can belong to exactly one partition. But please answer Lukas's questions so we can answer more appropriately. Date: Mon, 7 Apr 2014 11:23:58 +0200 From: lukas.naleze...@firma.seznam.cz To: user@giraph.apache.org Subject: Re: Can a vertex belong to more than one partition Hi, No, Vertex can belong only to one partition. Can you describe algorithm you are solving ? How many those vertexes belonging to all partitions you have ? Why do you need so strict partitioning ? Regards Lukas On 6.4.2014 12:38, Akshay Trivedi wrote: In order to custom partition the graph, WorkerGraphPartitioner has to be implemented. It has a method getPartitionOwner(I vertexId) which returns PartitionOwner of the vertex. I want that some vertices belong to all paritions i.e all PartitionOwners. Can anyone help me with it? Thankyou in advance Regards Akshay
RE: Changing index of a graph
It totally depends on the input distribution, one very simple thing that can be done is: Define a VertexResolver that upon every vertex creation sets its Id = domain of url value = set of urls in the domain; it keeps appending as more vertices with same id (i.e., domain) are read from input [Now you can ignore edges all together. All you are left with is these huge-vertices that are identified by domains contain value = set of urls] Here, you can use aggregator approach of sending in the (domain, count of set) to master - these aggregators are then combined to give something like [(domain1, offset1), (domain2, offset2), etc.] all vertices (the huge ones) read this aggregator and figure out their offset then while u output just output the vertices in the set with number = offset + number in set So u have a map now - though it is highly unstable because adding one more url to a domain later will change the order totally, that is when u can use id = domain + insert date, etc. [[which will stop working at some point because aggregator needs to carry huge messages, then the computation of offsets via aggregators needs to be done in multiple supersteps, etc.]] Anyway now that you have the map of url - numberAll you got to do is a join -that's simpleread your original table + this map table in a single giraph joband u can use 2 supersteps to rename all the vertices properly [[note that you can do another thing here as well, MUCH SIMPLER THAN ABOVE] superstep 0in your compute class have a thread local variable that increases for each vertex the thread computes, assign the value [(workerid,threadid) , number] to each vertex. now aggregate {(workerid, threadid), number} superstep 1;masternow we have see [{(workerid, threadid), count in group}]so recompute another aggregator which is like[{workerid, threadid), cumulative sum upto now]send this aggregator to workers workerread cumulative_sum from aggregator and add it up to each vertex's current value when you output the graph this time as edgeoutput, sourceid, targetid are set as vertex values = the count Date: Tue, 15 Apr 2014 23:40:39 +0200 Subject: Re: Changing index of a graph From: mneum...@spotify.com To: user@giraph.apache.org I have a pipeline that creates a graph then does some transformations on it (with Giraph). In the end I want to dump it into Neo4j to allow for cypher queries. I was told that I could make the batch import for Neo4j a lot faster if I would use Long identifiers without holes, and therefore matching there internal ID space. If I understand it right they use it to build an on disk index with it using the ID's as offsets, that's why it should have no holes. I didn't expect it to be so costly to change the index, but I guess this way I could at least spread the load to the cluster, since batch import happens on a single machine. Thanks 4 the input, I will see what makes the most sense with the limited time I have. On Tue, Apr 15, 2014 at 5:31 PM, Lukas Nalezenec lukas.naleze...@firma.seznam.cz wrote: Hi, I did same think in two M/R jobs during preprocesing - it was pretty powerful for web graphs but little bit slow. Solution for Giraph is: 1. Implement own partition which will iterate vertices in order. Use appropriate partitioner. 2. During first iteration you need to rename vertexes in each partition without holes. Holes will be only between partitions. At the end, get min and max vertex index for each partion, send it to master in aggregator and compute mapping required to delete holes. 3. During second iteration iterate all vertexes and delete holes by shifting vertex indexes. 4. rename edges (two more iterations)... Btw: Why do you need such indexes ? For HLL ? Lukas On 15.4.2014 15:33, Martin Neumann wrote: Hej, I have a huge edgelist (several billion edges) where node ID's are URL's. The algorithm I want to run needs the ID's to be long and there should be no holes in the ID space (so I cant simply hash the URL's). Is anyone aware of a simple solution that does not require a impractical huge hash map? My idea currently is to load the graph into another giraph job and then assigning a number to each node. This way the mapping of number to URL would be stored in the Node. Problem is that I have to assign the numbers in a sequential way to ensure there are no holes and numbers are unique. No Idea if this is even possible in Giraph. Any input is welcome cheers Martin
RE: Giraph Buffer Size
Are you using Java 7? Date: Wed, 16 Apr 2014 13:07:20 +0530 Subject: Re: Giraph Buffer Size From: agrta.ra...@gmail.com To: user@giraph.apache.org Hi Pavan, For all the intermediate processing there would be a buffer (intermediate memory space) that stores data, messages etc.and then the complete process further. Pls correct me if I am wrong. I have set Xms and Xmx values properly. The problem is that the task runs for small datasets but as the input data size is increased, it fails. The error that I am getting in sysout logs is- # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x2b404144, pid=10397, tid=1144650048 # # JRE version: 6.0_25-b06 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.0-b11 mixed mode linux-amd64 compressed oops) # Problematic frame: # J sun.nio.ch.SelectorImpl.processDeregisterQueue()V # # An error report file with more information is saved as: # /hadoopTaskTrackerLogsLocation/process_id/s_err_pid10397.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp Please suggest what should be done? Am I missing anything? Regards, Agrta Rawat On Wed, Apr 16, 2014 at 12:44 PM, Pavan Kumar A pava...@outlook.com wrote: What do u mean by buffer size? Just as a note, please ensure that Xmx Xms values are properly set for the mapper using mapred.child.java.opts or mapred.map.child.java.opts Also what does the error message show: please use pastebin post the link here. Date: Wed, 16 Apr 2014 12:13:29 +0530 Subject: Giraph Buffer Size From: agrta.ra...@gmail.com To: user@giraph.apache.org Hi All, I am trying to run a job in Giraph-1.0.0 on Hadoop-1.0.0 cluster with 3 nodes. Each node has 32gb RAM. In superstep-8 of my algorithm approximately 2M messages are being sent where size of each message is more than 20 kb. But the process sticks here and task gets failed. In the sysout logs, it shows Fatal Error. Is this error because Buffer is getting full? How can I increase the buffer size for giraph application? Please suggest. Regards, Agrta Rawat
RE: Optimal number of Workers
Giraph uses threads for compute, netty server, netty client on workers, execution pools, input, output etc.You can see most of these options in org.apache.giraph.conf.GiraphConstants for instance /** Netty client threads */ IntConfOption NETTY_CLIENT_THREADS = new IntConfOption(giraph.nettyClientThreads, 4, Netty client threads); /** Netty server threads */ IntConfOption NETTY_SERVER_THREADS = new IntConfOption(giraph.nettyServerThreads, 16, Netty server threads); /** Number of threads for vertex computation */ IntConfOption NUM_COMPUTE_THREADS = new IntConfOption(giraph.numComputeThreads, 1, Number of threads for vertex computation); /** Number of threads for input split loading */ IntConfOption NUM_INPUT_THREADS = new IntConfOption(giraph.numInputThreads, 1, Number of threads for input split loading); The idea is that if you run your job in a cluster of 5 machines: typically 1 machine is the master 4 of them are workers which load the graph compute on it. Each worker is a separate machine and to maximize its utilization we can use as many threads as it can handle. However, if you are running it in pseudo mode then all workers run on the same machine still try to launch the number of threads (default set in the config) - though each worker is now a thread (instead of a machine) it still launches all these other threads unscrupulously. Anyway, u can configure these threads spawned by workers to reduce the over all number of threads launched in your one machine. From: chadijaber...@hotmail.com To: user@giraph.apache.org Subject: Optimal number of Workers Date: Tue, 15 Apr 2014 13:34:53 +0200 Hello !!Can anybody explain how threads are used by worker in Giraph ? for which purposes ? how the number of thread to use is determined by worker? I often have the following error :org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread. A check on the number of thread by worker gives child processes with 100 threads by worker process (10 workers in a 12 processors machine), which is in my opinion too large isn't it ?if i reduce the number of workers , the number of threads decreases. How must we choose the number of workers? Thanks in advance.Chadi
RE: clustering coefficient (counting triangles) in giraph.
If what you need is http://en.wikipedia.org/wiki/Clustering_coefficient#Local_clustering_coefficientthen I implemented it in Giraph, will submit a patch soon Date: Mon, 17 Mar 2014 15:33:07 -0400 Subject: Re: clustering coefficient (counting triangles) in giraph. From: kaushikpatn...@gmail.com To: user@giraph.apache.org Check out this paper on implementing triangle counting in a BSP model by Prof David Bader from Georgia Tech. http://www.cc.gatech.edu/~bader/papers/GraphBSPonXMT-MTAAP2013.pdf I implemented a similar version in Apache Giraph, and it worked pretty well. You have to switch on the write to disk option though, as in the second and third cycle of the algorithm you have a massive message build up. On Mon, Mar 17, 2014 at 3:17 PM, Suijian Zhou suijian.z...@gmail.com wrote: Hi, Experts, Does anybody know if there are examples of implementation in giraph for clustering coefficient (counting triangles)? Thanks! Best Regards, Suijian
RE: Running one compute function after another..
Jyoti - I recently did a similar thing. In fact, my approach was exactly what Maja suggested. However, there is a caveat. You can switch computation class for workers in mastercompute's compute method but that requires the messages sent by computation class active before switching and messages received by computation class after switching to be the same. For instance Superstep 1 - Compute-A (M1)Superstep 2 - Compute-A (M1)Superstep 3 - Compute-B(receive M1, outgoing is M2) -- you can achieve this using AbstractComputation, instead of BasicComputation.However, if Compute-B needs to be used in superstep-4 as well i.e.Superstep 4 - Compute-B [it receives M2 but that conflicts with its definition] So in this case the trick isSuperstep 1 - Compute-A (M1)Superstep 2 - Compute-A (M1)time to switchSuperstep 3 - NoOpMessageSink extends AbstractComputationI,V,E,M1,M2 whose compute() = { translate M1- M2}make the switchSuperstep 4 - Compute-B (M2)Superstep 5 - Compute-B (M2) and so on. If your compute functions change alternatively then u can extend AbstractComputation likeSuperstep 1 - Compute-A (extends AbstractComputation M1, M2)Superstep 2 - Compute-B (extends AbstractComputation M2, M1)Superstep 3 - Compute-A (extends AbstractComputation M1, M2)Superstep 4 - Compute-B (extends AbstractComputation M2, M1) @Maja, please add-to /correct what I wrote. Thanks. From: majakabi...@fb.com To: user@giraph.apache.org Subject: Re: Running one compute function after another.. Date: Sat, 11 Jan 2014 19:01:08 + Hi Jyoti, A cleaner way to do this is to switch Computation class which is used in the moment your condition is satisfied. So you can have an aggregator to check whether the condition is met, and then in your MasterCompute you call setComputation(SecondComputationClass.class) when needed. Regards, Maja From: Jyoti Yadav rao.jyoti26ya...@gmail.com Reply-To: user@giraph.apache.org user@giraph.apache.org Date: Saturday, January 11, 2014 10:48 AM To: user@giraph.apache.org user@giraph.apache.org Subject: Re: Running one compute function after another.. Hi ? ??π???... I will go by this.. Thanks... On Sat, Jan 11, 2014 at 10:52 PM, ? ??π??? ikapo...@csd.auth.gr wrote: Hey, You can have a boolean variable initially set to true(or false, whatever). Then you divide your code based on the value of that variable with an if-else statement. For my example, if the value is true then it goes through the first 'if'. When the condition you want is fullfilled, change the value of the variable to false (at all nodes) and then the second part will be executed. Ilias 11/1/2014 6:18 ??, ?/? Jyoti Yadav ??: Hi folks.. In my algorithm,all vertices execute one compute function upto certain condition, when that condition is fulfilled,i want that all vertices now execute another compute function.Is it possible?? Any ideas are highly appreciated.. Thanks Jyoti
RE: Issues with Giraph v1.0.0
Hi Pankaj, Note that in Giraph, vertex is the first-class citizen, while edges are just data associated with a vertex.So, when you delete a vertex you delete all data associated with it i.e., its outgoing edges, its value, its id, etc. However, it is not trivial to delete all incoming edges to a vertex, since it is not directly aware of existence of such edges.Such edges can only be deleted at source-vertex. Note that based on the type of outedges you use, Giraph can support multi-graph or simple graph [both being directed of course]. So, if you want to avoid duplicated edges you can set OutEdge type to LongDoubleHashMapEdges for example. 23 are related. You can over-come all these at application layer by dedicating the first few supersteps to clean up your graph though. Thanks. From: pankajiit...@gmail.com Date: Fri, 13 Dec 2013 13:25:49 +0530 Subject: Issues with Giraph v1.0.0 To: user@giraph.apache.org CC: agrta.ra...@gmail.com Hi, I am facing following issues while using giraph-1.0.0-for-hadoop-1.0.2. 1. Deletion of a vertex does not delete the incoming edges of that vertex. 2. When removeVertexRequest(sourceId,targetId) method is used inside a for-loop, i.e., multiple delete requests are sent by a vertex, it deletes only the last edge identified by the for-loop. 3. Edge creation does not check for duplicate edges. Any pointers or solution will be helpful. Thanks in advance! Regards,Pankaj
RE: vertex and data block co-location
@DavidYou can have a look at http://researcher.watson.ibm.com/researcher/files/us-ytian/giraph++.pdfThis work was done by http://researcher.watson.ibm.com/researcher/view.php?person=us-ytianIn this she talks about alternative partitioning schemes she implemented on top of giraph and the showsthe resulting optimizations taking some graph algorithms as examples. Date: Sun, 8 Dec 2013 10:55:51 -0800 Subject: Re: vertex and data block co-location From: apache.mail...@gmail.com To: user@giraph.apache.org Running Giraph on MapReduce, you have no control over where the worker tasks will be hosted on the cluster. Therefore the partitioning generally is not aware of co-located blocks and does a fair amount of time-consuming network shuffling of data during the initialization of a Giraph job. What Giraph does do is, as each worker tasks spins up on the cluster, it attempts to claim input splits that happen to be local to the DataNode the worker runs on. This speeds up the initial injestion of graph data quite a bit, but does not help up much when it comes to distributing the data to the worker that owns that data's assigned partition. Only when all data have been been pushed to the appropriate worker can the Giraph job actually begin. When data actually does end up belonging to a host-local partition it is not sent over the network, but in many cases there is no alternative without using an alternate to hash partitioning. On Sat, Nov 16, 2013 at 12:22 PM, David J Garcia djch...@utexas.edu wrote: hello, I was wondering if there was a way to ensure that vertices located on the same data block (on hdfs) are co-located with each other? Also, will the vertices in input-splits (splits that are located on the same DataNode) have a reasonable chance of being partitioned to the same id? for example, suppose that I have vertex_1 located on data_block_i, and vertex_2 located on data_block_k. Let's suppose that both of the data blocks are located on the same DataNode machine. Is there a reasonably good chance that the vertex_1 and vertex_2 will partition to the same id? I'm doing a research project and I'm trying to show the benefits of graph data-locality. -David