Please welcome our newest committer, Igor Kabiljo!
I am pleased to announce that Igor Kabiljo has been invited to become a committer by the Project Management Committee (PMC) of Apache Giraph, and he accepted. Igor's most important contributions are implementing reduce/broadcast that generalizes aggregators and working on primitive message/edge storages that make applications more efficient, as well as around using specific partitioners that utilize good partitioning. He is also coming up with issues for beginners and guiding them along the way. Igor, we are looking forward to your future work and deeper involvement in the project. Thanks, Maja List of Igor’s contributions: GIRAPH-785: Improve GraphPartitionerFactory usage GIRAPH-786: XSparseVector create a lot of objects in add/write GIRAPH-848: Allowing plain computation with types being configurable GIRAPH-934: Allow having state in aggregators GIRAPH-935: Loosen modifiers when needed GIRAPH-938: Allow fast working with primitives generically GIRAPH-939: Reduce/broadcast API GIRAPH-954: Allow configurable Aggregators/Reducers again GIRAPH-955: Allow vertex/edge/message value to be configurable GIRAPH-961: Internals of MasterLoggingAggregator have been incorrectly removed GIRAPH-965: Improving and adding reducers GIRAPH-986: Add more stuff to TypeOps GIRAPH-987: Improve naming for ReduceOperation Beginner issues he guided: GIRAPH-891: Make MessageStoreFactory configurable GIRAPH-895: Trim the edges in Giraph GIRAPH-921: Create ByteValueVertex to store vertex values as bytes without object instance GIRAPH-988: Allow object to be specified as next Computation in Giraph
Please welcome our newest committer, Sergey Edunov!
I am happy to announce that the Project Management Committee (PMC) for Apache Giraph has elected Sergey Edunov to become a committer, and he accepted. Sergey has been an active member of Giraph community, finding issues, submitting patches and reviewing code. We’re looking forward to Sergey’s larger involvement and future work. List of his contributions: GIRAPH-895: Trim the edges in Giraph GIRAPH-896: Memory leak in SuperstepMetricsRegistry GIRAPH-897: Add an option to dump only live objects to JMap GIRAPH-898: Remove giraph-accumulo from Facebook profile GIRAPH-903: Detect crashes on Netty threads GIRAPH-924: Fix checkpointing GIRAPH-925: Unit tests should pass even if zookeeper port not available GIRAPH-927: Decouple netty server threads from message processing GIRAPH-933: Checkpointing improvements GIRAPH-936: Decouple netty server threads from message processing GIRAPH-940: Cleanup the list of supported hadoop versions GIRAPH-950: Auto-restart from checkpoint doesn't pick up latest checkpoint GIRAPH-963: Aggregators may fail with IllegalArgumentException upon deserialization Best, Maja
Re: [RESULT] [VOTE] Apache Giraph 1.1.0 RC2
Thank you for your work on release, Roman! On 11/18/14, 10:55 AM, Avery Ching ach...@apache.org wrote: Thanks for pushing this though Roman. Looks great! On 11/18/14, 4:30 AM, Roman Shaposhnik wrote: Hi! with 3 binding +1, one non-binding +1, no 0s or -1s the vote to publish Apache Giraph 1.1.0 RC2 as the 1.1.0 release of Apache Giraph passes. Thanks to everybody who spent time on validating the bits! The vote tally is +1s: Claudio Martella (binding) Maja Kabiljo (binding) Eli Reisman (binding) Roman Shaposhnik (non-binding) I'll do the publishing tonight and will send an announcement! Thanks, Roman (AKA 1.1.0 RM) On Thu, Nov 13, 2014 at 5:28 AM, Roman Shaposhnik ro...@shaposhnik.org wrote: This vote is for Apache Giraph, version 1.1.0 release It fixes the following issues: http://s.apache.org/a8X *** Please download, test and vote by Mon 11/17 noon PST Note that we are voting upon the source (tag): release-1.1.0-RC2 Source and binary files are available at: http://people.apache.org/~rvs/giraph-1.1.0-RC2/ Staged website is available at: http://people.apache.org/~rvs/giraph-1.1.0-RC2/site/ Maven staging repo is available at: https://repository.apache.org/content/repositories/orgapachegiraph-1003 Please notice, that as per earlier agreement two sets of artifacts are published differentiated by the version ID: * version ID 1.1.0 corresponds to the artifacts built for the hadoop_1 profile * version ID 1.1.0-hadoop2 corresponds to the artifacts built for hadoop_2 profile. The tag to be voted upon (release-1.1.0-RC1): https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=log;h=refs/tags/r elease-1.1.0-RC2 The KEYS file containing PGP keys we use to sign the release: http://svn.apache.org/repos/asf/bigtop/dist/KEYS Thanks, Roman.
Re: [VOTE] Apache Giraph 1.1.0 RC2
+1, thanks Roman! From: Claudio Martella claudio.marte...@gmail.commailto:claudio.marte...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, November 13, 2014 at 5:53 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Cc: d...@giraph.apache.orgmailto:d...@giraph.apache.org d...@giraph.apache.orgmailto:d...@giraph.apache.org Subject: Re: [VOTE] Apache Giraph 1.1.0 RC2 +1. On Thu, Nov 13, 2014 at 2:28 PM, Roman Shaposhnik ro...@shaposhnik.orgmailto:ro...@shaposhnik.org wrote: This vote is for Apache Giraph, version 1.1.0 release It fixes the following issues: http://s.apache.org/a8Xhttps://urldefense.proofpoint.com/v1/url?u=http://s.apache.org/a8Xk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=dXTURlrjPHD41cP709FuweEHwlsqnR66FLHgQvqgh0I%3D%0As=98fdbc39e759f0f8254e4120fe3a4a426e7e75bfe43864fa9a914e984c21b102 *** Please download, test and vote by Mon 11/17 noon PST Note that we are voting upon the source (tag): release-1.1.0-RC2 Source and binary files are available at: http://people.apache.org/~rvs/giraph-1.1.0-RC2/https://urldefense.proofpoint.com/v1/url?u=http://people.apache.org/~rvs/giraph-1.1.0-RC2/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=dXTURlrjPHD41cP709FuweEHwlsqnR66FLHgQvqgh0I%3D%0As=01b912a347fdd1dfc2c35c01db12a183688c325bce3c0eaa11df1ed1459ee927 Staged website is available at: http://people.apache.org/~rvs/giraph-1.1.0-RC2/site/https://urldefense.proofpoint.com/v1/url?u=http://people.apache.org/~rvs/giraph-1.1.0-RC2/site/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=dXTURlrjPHD41cP709FuweEHwlsqnR66FLHgQvqgh0I%3D%0As=0d2cb5857cfadba81d3e1939e04401dceb156f66678d16775eee739b0b1894d3 Maven staging repo is available at: https://repository.apache.org/content/repositories/orgapachegiraph-1003https://urldefense.proofpoint.com/v1/url?u=https://repository.apache.org/content/repositories/orgapachegiraph-1003k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=dXTURlrjPHD41cP709FuweEHwlsqnR66FLHgQvqgh0I%3D%0As=57e18b19672722ecde807aefeac57da03345594eebf281643487cd51b8662826 Please notice, that as per earlier agreement two sets of artifacts are published differentiated by the version ID: * version ID 1.1.0 corresponds to the artifacts built for the hadoop_1 profile * version ID 1.1.0-hadoop2 corresponds to the artifacts built for hadoop_2 profile. The tag to be voted upon (release-1.1.0-RC1): https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=log;h=refs/tags/release-1.1.0-RC2https://urldefense.proofpoint.com/v1/url?u=https://git-wip-us.apache.org/repos/asf?p%3Dgiraph.git%3Ba%3Dlog%3Bh%3Drefs/tags/release-1.1.0-RC2k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=dXTURlrjPHD41cP709FuweEHwlsqnR66FLHgQvqgh0I%3D%0As=3c1f5c6b52e87279c2925a3352fb3fd6b73ef37c725a474962b0c03789104a2f The KEYS file containing PGP keys we use to sign the release: http://svn.apache.org/repos/asf/bigtop/dist/KEYShttps://urldefense.proofpoint.com/v1/url?u=http://svn.apache.org/repos/asf/bigtop/dist/KEYSk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=dXTURlrjPHD41cP709FuweEHwlsqnR66FLHgQvqgh0I%3D%0As=ed7250c04fdf453e2b35faadab093ecd7fa188ad7cee912177148c5736163659 Thanks, Roman. -- Claudio Martella
Re: [VOTE] Apache Giraph 1.1.0 RC1
We¹ve been running code which is the same as release candidate plus fix on GIRAPH-961 in production for 5 days now, no problems. This is hadoop_facebook profile, using only hive-io from all io modules. On 11/1/14, 3:49 PM, Roman Shaposhnik ro...@shaposhnik.org wrote: Ping! Any progress on testing the current RC? Thanks, Roman. On Fri, Oct 31, 2014 at 9:00 AM, Claudio Martella claudio.marte...@gmail.com wrote: Oh, thanks for the info! On Fri, Oct 31, 2014 at 3:06 PM, Roman Shaposhnik ro...@shaposhnik.org wrote: On Fri, Oct 31, 2014 at 3:26 AM, Claudio Martella claudio.marte...@gmail.com wrote: Hi Roman, thanks again for this. I have had a look at the staging site so far (our cluster has been down whole week... universities...), and I was wondering if you have an insight why some of the docs are missing, e.g. gora and rexster documentation. None of them are missing. The links moved to a User Docs - Modules though: https://urldefense.proofpoint.com/v1/url?u=http://people.apache.org/~rvs /giraph-1.1.0-RC1/site/gora.htmlk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg 8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=8PzjCy0QzsbRm9lbAnj 1Sreanb81jw%2FnRRX1Zju8ZvM%3D%0As=aabb0575b0830bb2c1b05645279b426e8789e fca3a6049073b214e2fbf832ec7 https://urldefense.proofpoint.com/v1/url?u=http://people.apache.org/~rvs /giraph-1.1.0-RC1/site/rexster.htmlk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar= RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=8PzjCy0QzsbRm9lb Anj1Sreanb81jw%2FnRRX1Zju8ZvM%3D%0As=08f4a813900872e6085eea6d3569bf7db0 078c050d4aac784c0d61ba8f70504d and so forth. Thanks, Roman. -- Claudio Martella
Re: [VOTE] Apache Giraph 1.1.0 RC1
Roman, again thanks for taking care of the release. We found one issue https://issues.apache.org/jira/browse/GIRAPH-961 - any application using MasterLoggingAggregator fails without this fix. Can we backport it to the release? Thanks, Maja On 10/26/14, 12:25 AM, Roman Shaposhnik ro...@shaposhnik.org wrote: This vote is for Apache Giraph, version 1.1.0 release It fixes the following issues: https://urldefense.proofpoint.com/v1/url?u=http://s.apache.org/a8Xk=ZVNjl DMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJod KY%3D%0Am=vuRjLsufH81dyuj4l%2BAg5A3PGGSsvGxYyv1pMk0nLgA%3D%0As=cbb0287e4 058ca62c8f79b319121949555bdfba84aaf17a71e3191bde97e8110 *** Please download, test and vote by Mon 11/3 noon PST Note that we are voting upon the source (tag): release-1.1.0-RC1 Source and binary files are available at: https://urldefense.proofpoint.com/v1/url?u=http://people.apache.org/~rvs/g iraph-1.1.0-RC1/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD 1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=vuRjLsufH81dyuj4l%2BAg5A3PGGSsvGxYyv1pM k0nLgA%3D%0As=d120562d308957a39a2f31942102a17fc1c243c3450f92793536568d601 7230c Staged website is available at: https://urldefense.proofpoint.com/v1/url?u=http://people.apache.org/~rvs/g iraph-1.1.0-RC1/site/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K9 5hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=vuRjLsufH81dyuj4l%2BAg5A3PGGSsvGxY yv1pMk0nLgA%3D%0As=3dfee9d283999801f5c4c52b6ab99ef60bfa517884f693b8b31124 2bb1371871 Maven staging repo is available at: https://urldefense.proofpoint.com/v1/url?u=https://repository.apache.org/c ontent/repositories/orgapachegiraph-1002k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A r=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=vuRjLsufH81dyuj 4l%2BAg5A3PGGSsvGxYyv1pMk0nLgA%3D%0As=fb75b2940fd8ea0b956b85303801d08020b be748d0d93ec82cc2cf18729f9b74 Please notice, that as per earlier agreement two sets of artifacts are published differentiated by the version ID: * version ID 1.1.0 corresponds to the artifacts built for the hadoop_1 profile * version ID 1.1.0-hadoop2 corresponds to the artifacts built for hadoop_2 profile. The tag to be voted upon (release-1.1.0-RC1): https://urldefense.proofpoint.com/v1/url?u=https://git-wip-us.apache.org/r epos/asf?p%3Dgiraph.git%3Ba%3Dcommit%3Bh%3D1f0fc23c26ce3addb746e3e57cc155f 82afbab87k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1C Q%2BbcclArMcjzJodKY%3D%0Am=vuRjLsufH81dyuj4l%2BAg5A3PGGSsvGxYyv1pMk0nLgA% 3D%0As=1b7468aca53943fc0794e58a1b884342dae353c5da294d32e9b698cf0abfe254 The KEYS file containing PGP keys we use to sign the release: https://urldefense.proofpoint.com/v1/url?u=http://svn.apache.org/repos/asf /bigtop/dist/KEYSk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnY D1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=vuRjLsufH81dyuj4l%2BAg5A3PGGSsvGxYyv1p Mk0nLgA%3D%0As=5f9023144fb07f177de59c5843259cbd8dfb9ba0a2d25444140781a36f b45010 Thanks, Roman.
Re: Running one compute function after another..
Hi Jyoti, A cleaner way to do this is to switch Computation class which is used in the moment your condition is satisfied. So you can have an aggregator to check whether the condition is met, and then in your MasterCompute you call setComputation(SecondComputationClass.class) when needed. Regards, Maja From: Jyoti Yadav rao.jyoti26ya...@gmail.commailto:rao.jyoti26ya...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Saturday, January 11, 2014 10:48 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Running one compute function after another.. Hi ? ??π???... I will go by this.. Thanks... On Sat, Jan 11, 2014 at 10:52 PM, ? ??π??? ikapo...@csd.auth.grmailto:ikapo...@csd.auth.gr wrote: Hey, You can have a boolean variable initially set to true(or false, whatever). Then you divide your code based on the value of that variable with an if-else statement. For my example, if the value is true then it goes through the first 'if'. When the condition you want is fullfilled, change the value of the variable to false (at all nodes) and then the second part will be executed. Ilias 11/1/2014 6:18 ??, ?/? Jyoti Yadav ??: Hi folks.. In my algorithm,all vertices execute one compute function upto certain condition, when that condition is fulfilled,i want that all vertices now execute another compute function.Is it possible?? Any ideas are highly appreciated.. Thanks Jyoti
Re: About writing our own aggregator..
Hi Jyoti, You can take a look inside of org.apache.giraph.aggregators package, there are many implementations already there. Some simple, like LongSumAggregator, and some more complex ones inside of matrix package. Please look through that and let me know if you need additional help. When you manage to implement this, you can also contribute it back to Giraph! Maja From: Jyoti Yadav rao.jyoti26ya...@gmail.commailto:rao.jyoti26ya...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, January 9, 2014 12:23 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: About writing our own aggregator.. Hi Folks... I am trying to implement one graph algorithm on giraph. I want that all vertices send their ids to the master .For that I need to implement my own aggregator class. Please suggest me how to proceed... Thanks Jyoti
Re: Problem with Giraph (please help me)
Hi Chadi, That does seem like a serialization issue. Which OutEdges class are you using, is it something you implemented? Regards, Maja From: chadi jaber chadijaber...@hotmail.commailto:chadijaber...@hotmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, January 9, 2014 2:08 AM To: Lukas Nalezenec lukas.naleze...@firma.seznam.czmailto:lukas.naleze...@firma.seznam.cz, user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: RE: Problem with Giraph (please help me) Hello Lukas I have enclosed in my previous emails the exception. It seems to be a serialization issue (This occurs only when workers 1) ... 2013-12-31 16:27:33,494 INFO org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Successfully added 4 connections, (4 total connected) 0 failed, 0 failures total. 2013-12-31 16:27:33,501 INFO org.apache.giraph.worker.BspServiceWorker: loadInputSplits: Using 1 thread(s), originally 1 threads(s) for 1 total splits. 2013-12-31 16:27:33,508 INFO org.apache.giraph.comm.SendPartitionCache: SendPartitionCache: maxVerticesPerTransfer = 1 2013-12-31 16:27:33,508 INFO org.apache.giraph.comm.SendPartitionCache: SendPartitionCache: maxEdgesPerTransfer = 8 2013-12-31 16:27:33,524 INFO org.apache.giraph.worker.InputSplitsCallable: call: Loaded 0 input splits in 0.020270009 secs, (v=0, e=0) 0.0 vertices/sec, 0.0 edges/sec 2013-12-31 16:27:33,527 INFO org.apache.giraph.comm.netty.NettyClient: waitAllRequests: Finished all requests. MBytes/sec sent = 0, MBytes/sec received = 0, MBytesSent = 0, MBytesReceived = 0, ave sent req MBytes = 0, ave received req MBytes = 0, secs waited = 0.656 2013-12-31 16:27:33,527 INFO org.apache.giraph.worker.BspServiceWorker: setup: Finally loaded a total of (v=0, e=0) 2013-12-31 16:27:33,598 INFO org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window metrics MBytes/sec sent = 0, MBytes/sec received = 0, MBytesSent = 0, MBytesReceived = 0, ave sent req MBytes = 0, ave received req MBytes = 0, secs waited = 0.816 2013-12-31 16:27:33,605 WARN org.apache.giraph.comm.netty.handler.RequestServerHandler: exceptionCaught: Channel failed with remote address /172.16.45.53:59257 java.io.EOFException at org.jboss.netty.buffer.ChannelBufferInputStream.checkAvailable(ChannelBufferInputStream.java:231) at org.jboss.netty.buffer.ChannelBufferInputStream.readInt(ChannelBufferInputStream.java:174) at org.apache.giraph.edge.ByteArrayEdges.readFields(ByteArrayEdges.java:172) at org.apache.giraph.utils.WritableUtils.reinitializeVertexFromDataInput(WritableUtils.java:480) at org.apache.giraph.utils.WritableUtils.readVertexFromDataInput(WritableUtils.java:511) at org.apache.giraph.partition.SimplePartition.readFields(SimplePartition.java:126) at org.apache.giraph.comm.requests.SendVertexRequest.readFieldsRequest(SendVertexRequest.java:66) at org.apache.giraph.comm.requests.WritableRequest.readFields(WritableRequest.java:120) at org.apache.giraph.comm.netty.handler.RequestDecoder.decode(RequestDecoder.java:92) at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:72) at org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:69) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) the code for my vertex compute function : public class MergeVertex extends VertexLongWritable,DoubleWritable, DoubleWritable, NodeMessage { ... /*** * Convert a Vertex Id from its LongWritable format to Point format (2 Element Array Format) * @param lng LongWritable Format of the VertexId * @return Alignment point Array */ public static int[] cvtLongToPoint(LongWritable lng){ int[] point={0,0}; point[0]=(int) (lng.get()/1000); point[1]=(int) (lng.get()% 1000); return point; } @Override public void compute(IterableNodeMessage messages) throws IOException { int currentId[]= cvtLongToPoint(getId()); if (getSuperstep()==0) { //NodeValue nv=new NodeValue(); setValue(new DoubleWritable(0d)); } _signallength=getContext().getConfiguration().getInt(SignalLength,0); if((getSuperstep() _signallength getId().get()!=0L) || (getSuperstep()== 0 getId().get()==0L)){ LongWritable dstId=new LongWritable(); //Nodes which are on Graph Spine //Remaining Edges Construction if(currentId[0]== currentId[1]){ //right Side for (int i=currentId[1]+1;i_signallength;i++){ dstId=cvtPointToLong(currentId[0]+1,i); addVertexRequest(dstId,new DoubleWritable(Double.MAX_VALUE)); addEdgeRequest(getId(),EdgeFactory.create(dstId, new DoubleWritable(computeCost(getId(),dstId; } //Left Side for (int i=currentId[0]+2;i_signallength;i++){ dstId=cvtPointToLong(i,currentId[1]+1); addVertexRequest(dstId,new
Re: A Vertex Holds Other Than Text
Hi Agrta, Take a look at IntIntTextVertexValueInputFormat for example, where vertex values are ints. If your vertex values are complex objects, you need to create a class which implements Writable interface which is going to hold all your data, and then extend the input format to read all the data you have. Hope this helps, if not please give us some more details about what you are trying to do. Regards, Maja From: Agrta Rawat agrta.ra...@gmail.commailto:agrta.ra...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Wednesday, January 1, 2014 3:53 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: A Vertex Holds Other Than Text Hi, I am implementing an algorithm in which a Vertex needs to hold its values in a class other than Text (as the value of a vertex is a record). I am trying to make use of VertexValueInputFormat but can't reach solution. My Giraph version is 1.0.0. Kindly help me in resolving this issue. regards, Agrta Rawat
Re: Extending AbstractComputation
Hi Pushparaj and Peter, There is going to be one Computation per partition in each of the supersteps. Each partition is processed by a single thread, so accessing any data inside of your Computation is thread-safe. Multiple threads are going to be executing computation on multiple partitions, and therefore not interfere with each other. The only part which you have to worry about synchronization is if you are using pre/postSuperstep and accessing some global data from WorkerContext. Regards, Maja From: Peter Grman peter.gr...@gmail.commailto:peter.gr...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Monday, December 23, 2013 3:03 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Extending AbstractComputation I don't know the exact logic, maybe somebody who does could elaborate on that, but I noticed that it was used multiple times for different Nodes, I would think that it is used as a pool to minimize the number of object created, am I right here? The question I would add, can it be that the compute function is called concurrently on multiple objects or is it really a pool and the calls to the function don't interfere with each other? Thank Peter --- Imagination is more important than knowledge. For knowledge is limited, whereas imagination embraces the entire world, stimulating progress, giving birth to evolution. It is, strictly speaking, a real factor in scientific research. - Albert Einstein On Mon, Dec 23, 2013 at 8:56 PM, Pushparaj Motamari pushpara...@gmail.commailto:pushpara...@gmail.com wrote: Hi, The class we write extending AbstractComputation, is instantiated one per worker? Thanks Pushparaj
Re: MultiVertexInputFormat
Hi Yasser, You can do this through the Configuration parameters. You should call: description1.addParameter(myApplication.vertexInputPath, file1.txt); and description2.addParameter(myApplication.vertexInputPath, file2.txt); Then from the code of your InputFormat class you can get this parameter from Configuration. If it's not already, make sure your InputFormat implements ImmutableClassesGiraphConfigurable, and configuration is going to be set in it automatically. You can also take a look at HiveGiraphRunner which uses multiple inputs and sets parameters user passes from command line. Hope this helps, Maja From: Yasser Altowim yasser.alto...@ericsson.commailto:yasser.alto...@ericsson.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Monday, August 19, 2013 9:16 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: RE: MultiVertexInputFormat Hi Guys, Any help on this will be appreciated. I am repeating my question and my code below: I am implementing an algorithm in Giraph that reads the vertex values from two input files, each has its own format. I am not using any EdgeInputFormatClass. I am now using VertexInputFormatDescription along with MultiVertexInputFormats, but still could not figure out how to set the Vertex input path for each Input Format Class. Can you please take a look at my code below and show me how to set the Vertex Input Path? I have taken a look at HiveGiraphRunner but still no luck. Thanks if (null == getConf()) { conf = new Configuration(); } GiraphConfiguration gconf = new GiraphConfiguration(getConf()); int workers = Integer.parseInt(arg0[2]); gconf.setWorkerConfiguration(workers, workers, 100.0f); ListVertexInputFormatDescription vertexInputDescriptions = Lists.newArrayList(); // Input one VertexInputFormatDescription description1 = new VertexInputFormatDescription(UseCase1FirstVertexInputFormat.class); // how to set the vertex input path? i.e. how to say that I want to read file1.txt using this input format class vertexInputDescriptions.add(description1); // Input two VertexInputFormatDescription description2 = new VertexInputFormatDescription(UseCase1SecondVertexInputFormat.class); // how to set the vertex input path? vertexInputDescriptions.add(description2); GiraphConstants.VERTEX_INPUT_FORMAT_CLASS.set(gconf, MultiVertexInputFormat.class); VertexInputFormatDescription.VERTEX_INPUT_FORMAT_DESCRIPTIONS.set(gconf,InputFormatDescription.toJsonString(vertexInputDescriptions)); gconf.setVertexOutputFormatClass(UseCase1OutputFormat.class); gconf.setComputationClass(UseCase1Vertex.class); GiraphJob job = new GiraphJob(gconf, Use Case 1); FileOutputFormat.setOutputPath(job.getInternalJob(), new Path(arg0[1])); return job.run(true) ? 0 : -1; Thanks in advance. Best, Yasser From: Yasser Altowim [mailto:yasser.alto...@ericsson.com] Sent: Friday, August 16, 2013 11:36 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org Subject: RE: MultiVertexInputFormat Thanks a lot Avery for your response. I am now using VertexInputFormatDescription, but still could not figure out how to set the Vertex input path. I just need to read the vertex values from two different files, each with its own format. I am not using any EdgeInputFormatClass. Can you please take a look at my code below and show me how to set the Vertex Input Path? Thanks if (null == getConf()) { conf = new Configuration(); } GiraphConfiguration gconf = new GiraphConfiguration(getConf()); int workers = Integer.parseInt(arg0[2]); gconf.setWorkerConfiguration(workers, workers, 100.0f); ListVertexInputFormatDescription vertexInputDescriptions = Lists.newArrayList(); // Input one VertexInputFormatDescription description1 = new VertexInputFormatDescription(UseCase1FirstVertexInputFormat.class); // how to set the vertex input path? vertexInputDescriptions.add(description1); // Input two VertexInputFormatDescription description2 = new VertexInputFormatDescription(UseCase1SecondVertexInputFormat.class); // how to set the vertex input path? vertexInputDescriptions.add(description2); VertexInputFormatDescription.VERTEX_INPUT_FORMAT_DESCRIPTIONS.set(gconf,InputFormatDescription.toJsonString(vertexInputDescriptions)); gconf.setVertexOutputFormatClass(UseCase1OutputFormat.class); gconf.setComputationClass(UseCase1Vertex.class); GiraphJob job = new GiraphJob(gconf, Use Case 1); FileOutputFormat.setOutputPath(job.getInternalJob(), new Path(arg0[1]));
Re: Multiple Data Sources
Hi Tom, We recently added something like this, please take a look at MultiVertexInputFormat. That one can basically wrap any number of vertex input formats, coming from any sources. You can also take a look at HiveGiraphRunner to see how it's used there. As for multiple vertex types, we don't have that directly supported, but you can have some variable describing the vertex type inside of your vertex value. Hope this helps, please let us know if you have any questions! Maja From: Tom M thnyanmthn...@gmail.commailto:thnyanmthn...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Monday, July 15, 2013 9:54 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Multiple Data Sources Hi, I am a new to Giraph. I am working on implementing a graph algorithm that first reads vertex values from multiple sources (HDFS, MySQL). So basically, I would have two types of vertices, values of each vertex type can be read from a different data source. I know that, in MR, we can use DBInputFormat to retrieve tuples from RDBMS for example, and then join them with data read from HDFS. My question, can we do that in Giraph? i.e. can the graph be constructed from different data sources? Thanks a lot in advance. Best, Tom
Re: Regarding multiple values of a vertex
Hi Harsh, The other thing you can do at the moment is make another implementation of Partition (similar to SimplePartition) which is going to do a different thing when duplicate vertex is encountered, and then set giraph.partitionClass to your Partition. Maja From: Alessandro Presta alessan...@fb.commailto:alessan...@fb.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Tuesday, July 9, 2013 10:57 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Regarding multiple values of a vertex Hi Harsh, It's currently not possible to combine multiple vertex values, but it is on our roadmap. For now, you could try using MapReduce to aggregate those values before you feed them to the Giraph job. Alessandro From: Harsh Rathi harsh.c...@gmail.commailto:harsh.c...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Tuesday, July 9, 2013 12:24 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Regarding multiple values of a vertex Hi All, I am taking input graph in the form of 2 separate files Edge-List and Vertex-List. In Vertex-List file, a vertex can have multiple values (value of vertex is of text format) i.e. there can be multiple entries of vertex-value pair for same vertex. While taking input of a vertex in Giraph, it checks whether if vertex is already present in graph, then it replaces the old value with new value of vertex. I want to append all the vertex values for the same vertex (String format). I can do it by changing the giraph-core's source code. But, I am looking for a solution in which while taking input using vertex-input class, it is possible retrieve old value of that vertex. Is it possible to do what I am proposing ? Can I retrieve the value of vertex using Vertex Id in vertex-input class ? Thanks Harsh Rathi IIT Delhi
Re: Are new vertices active?
Hi Christian, As javadoc for getTotalNumVertices() says, it returns the number of vertices which existed in previous superstep, so newly created vertices are not going to be counted there. In the code mutations are applied before the next superstep starts. The way it's currently implemented, vertices created during last superstep won't exist during output. That being said, I don't know if we wanted it that way, or it just turned out like that since nobody thought about that case. Maja From: Christian Krause m...@ckrause.orgmailto:m...@ckrause.org Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, June 27, 2013 4:59 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Are new vertices active? Thank you, Claudio. Regarding the last point: I am mutating the graph in superstep N, and in N+1 I am logging the total number of vertices and halt all nodes. When I am doing it like this, I don't get the updated number of vertices. However, if I wait one more superstep, I get the correct number. Strange.. Cheers, Christian 2013/6/26 Claudio Martella claudio.marte...@gmail.commailto:claudio.marte...@gmail.com Hi, inline are my (tentative) answers. On Wed, Jun 26, 2013 at 6:34 PM, Christian Krause m...@ckrause.orgmailto:m...@ckrause.org wrote: Hi, if I create new vertices, will they be executed in the next superstep? And does it make a difference whether I create them using addVertexRequest() or sendMessage()? The vertex will be active. The case of a sendMessage is intuitive, because a message wakens up a vertex. Another question: if I mutate the graph in superstep X and X is the last superstep, will the changes be executed? It is not clear to me whether the graph changes are executed during or before the next superstep. I'm actually not sure about our internal implementation, somebody can shade light on this, but I'd expect it to be running due to above (presence of active vertices). And related to the last question, if I mutate the graph in superstep X, and I call getTotalNumVertices() in the next step, can I expect the updated number of vertices, or the number of vertices before the mutation? The mutatiations are applied at the end of a superstep and are visibile in the following one. Hence in s+1 you'd see the new number of vertices. Sorry for these very basic questions, but I did not find any documentation on these details. If this is documented somewhere, it would be helpful to get a link. Cheers, Christian -- Claudio Martella claudio.marte...@gmail.commailto:claudio.marte...@gmail.com
Re: What if the resulting graph is larger than the memory?
Hi JU, One thing you can try is to use out-of-core graph (giraph.useOutOfCoreGraph option). I don't know what your exact use case is – do you have the graph which is huge or the data which you calculate in your application is? In the second case, there is 'giraph.doOutputDuringComputation' option you might want to try out. When that is turned on, during each superstep writeVertex will be called immediately after compute for that vertex is called. This means that you can store data you want to write in vertex, write it and clear the data before going to the next vertex. Maja From: Han JU ju.han.fe...@gmail.commailto:ju.han.fe...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Friday, May 17, 2013 8:38 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: What if the resulting graph is larger than the memory? Hi, It's me again. After a day's work I've coded a Giraph solution for my problem at hand. I gave it a run on a medium dataset and it's notably faster than other approaches. However the goal is to process larger inputs, for example I've a larger dataset that the result graph is about 400GB when represented in edge format and in text file. And I think the edges that the algorithm created all reside in the cluster's memory. So it means that for this big dataset, I need a cluster with ~ 400GB main memory to run? Is there any possibilities that I can output on the go that means I don't need to construct the whole graph, an edge is outputed to HDFS immediately instead of being created in main memory then be outputed? Thanks! -- JU Han Software Engineer Intern @ KXEN Inc. UTC - Université de Technologie de Compiègne GI06 - Fouille de Données et Décisionnel +33 061960
Re: Broadcast of large aggregated value is slow.
Eric, Can you please take a look at the logs of one of the workers listed (13, 34, 38, 50, 48, 52, 58, 56), what are they doing? The fact that a worker is waiting on aggregator can have different causes, it doesn’t necessarily mean that sending aggregators is slow. It can for example mean that some workers finished computing before others and are now waiting for others to finish and send their data. How big are aggregators which you are using? Thanks, Maja From: Eric Kimbrel eric.kimb...@soteradefense.commailto:eric.kimb...@soteradefense.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, May 16, 2013 2:00 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Broadcast of large aggregated value is slow. From the attached logs in original post, you can see that both workers use about 4 seconds of compute time on super step 4, but they complete super step 4 about 10 minutes apart. Eric Kimbrel Software Engineer I Data Fusion Analytics Sotera Defense Solutions, Inc. o: 360-516-6621 c: 360-990-1873 e: eric.kimb...@soteradefense.commailto:first.l...@soteradefense.com w: www.potomacfusion.comhttps://urldefense.proofpoint.com/v1/url?u=http://www.potomacfusion.com/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=rMjEN5TrXaS2BX1KqSuqFERFV5ssM40qL4bcaGFCtvE%3D%0As=206a9bd1407d0a4e7cdc6007d5c113baf96438de1c17043e501877ff185a6a3c | www.soteradefense.comhttps://urldefense.proofpoint.com/v1/url?u=http://www.soteradefense.com/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=rMjEN5TrXaS2BX1KqSuqFERFV5ssM40qL4bcaGFCtvE%3D%0As=e2806a46969606798541933625edcd907e560f71b173ad03f7eda8fb18ff175a Agility. Ingenuity. Integrity. From: Eric Kimbrel eric.kimb...@soteradefense.commailto:eric.kimb...@soteradefense.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, May 16, 2013 1:50 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Broadcast of large aggregated value is slow. I have an giraph job in which the Master will read a chunk of a file from HDFS, and then use an aggregator to broadcast the data to all vertices. No other messages are sent, and no vertices aggregate values, only the master. In the attached logs you can see that the time spent to broadcast the data to all vertices is slow, and seems to be hanging up somehwere. It appears that the majority of workers receive the data in 10-15 seconds, but then nothing happens for around 10 minutes. Log snippet shown below Is there a known reason why transmitting this data during the synchronization is taking so long, or anything that can be done to speed it up? 2013-05-16 11:09:03,041 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 30 more tasks to send their aggregator data 2013-05-16 11:09:14,444 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 10 more tasks to send their aggregator data, task ids: [13, 20, 22, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:09:25,190 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 8 more tasks to send their aggregator data, task ids: [13, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:09:45,191 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 8 more tasks to send their aggregator data, task ids: [13, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:10:05,191 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 8 more tasks to send their aggregator data, task ids: [13, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:10:15,192 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 8 more tasks to send their aggregator data, task ids: [13, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:10:35,193 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 8 more tasks to send their aggregator data, task ids: [13, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:10:55,193 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 8 more tasks to send their aggregator data, task ids: [13, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:11:05,194 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 8 more tasks to send their aggregator data, task ids: [13, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:11:25,195 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 8 more tasks to send their aggregator data, task ids: [13, 34, 38, 50, 48, 52, 58, 56] 2013-05-16 11:11:45,196 INFO
Re: Broadcast of large aggregated value is slow.
Thanks, Eric. I'm not sure what's going on, it's strange that there are a couple of machines which wait for aggregators for a very long time, but then in exactly same moment they receive them. Can you send us the code for Aggregator which you are using? Do you know approximately how big aggregators are and how many of them do you have? Maja From: Eric Kimbrel eric.kimb...@soteradefense.commailto:eric.kimb...@soteradefense.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, May 16, 2013 2:36 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Broadcast of large aggregated value is slow. My apologies. You are correct, I attached the wrong long. Correct one attached here. Eric Kimbrel Software Engineer I Data Fusion Analytics Sotera Defense Solutions, Inc. o: 360-516-6621 c: 360-990-1873 e: eric.kimb...@soteradefense.commailto:first.l...@soteradefense.com w: www.potomacfusion.comhttps://urldefense.proofpoint.com/v1/url?u=http://www.potomacfusion.com/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=JneyqIVoubY0J4ko9BK2DwfsA%2BN6Qy8nBTZj%2BVg78Uw%3D%0As=9cf5380dcf55ff5999b2f5b12c9a93b206777c47e775a949e9da6f6a8ce4f173 | www.soteradefense.comhttps://urldefense.proofpoint.com/v1/url?u=http://www.soteradefense.com/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=JneyqIVoubY0J4ko9BK2DwfsA%2BN6Qy8nBTZj%2BVg78Uw%3D%0As=70d18c70634f46634d557cf4f36276e3e5936b40e403d69a1ac10e3e4e5ff52b Agility. Ingenuity. Integrity. From: Maja Kabiljo majakabi...@fb.commailto:majakabi...@fb.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, May 16, 2013 2:25 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Broadcast of large aggregated value is slow. Resent-From: eric.kimb...@soteradefense.commailto:eric.kimb...@soteradefense.com Eric, Can you please check it again, in both logs you attached we are waiting on the worker 13 to send data, so none of those can't be worker 13's log. Maja From: Eric Kimbrel eric.kimb...@soteradefense.commailto:eric.kimb...@soteradefense.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, May 16, 2013 2:15 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Broadcast of large aggregated value is slow. One of the attached logs is worker 13, During this time period it is waiting for an aggregator request so that it can start the super step. Eric Kimbrel Software Engineer I Data Fusion Analytics Sotera Defense Solutions, Inc. o: 360-516-6621 c: 360-990-1873 e: eric.kimb...@soteradefense.commailto:first.l...@soteradefense.com w: www.potomacfusion.comhttps://urldefense.proofpoint.com/v1/url?u=http://www.potomacfusion.com/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=IVLhuSbQeHVpz2XEdAMnlmA5DbtqWgrwg930PpuMQoQ%3D%0As=b933c5068d68b34f5bfbac0db0f8eb919a01dacd3555330fe3147bbf53399d72 | www.soteradefense.comhttps://urldefense.proofpoint.com/v1/url?u=http://www.soteradefense.com/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=RGg8bUFUf%2FM2K95hnYD1RGWK1CQ%2BbcclArMcjzJodKY%3D%0Am=IVLhuSbQeHVpz2XEdAMnlmA5DbtqWgrwg930PpuMQoQ%3D%0As=f5fe0f489b7bfc207fb44206c50b2f74d0763169d0db18557586da3ce1d83443 Agility. Ingenuity. Integrity. From: Maja Kabiljo majakabi...@fb.commailto:majakabi...@fb.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, May 16, 2013 2:11 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Broadcast of large aggregated value is slow. Resent-From: eric.kimb...@soteradefense.commailto:eric.kimb...@soteradefense.com Eric, Can you please take a look at the logs of one of the workers listed (13, 34, 38, 50, 48, 52, 58, 56), what are they doing? The fact that a worker is waiting on aggregator can have different causes, it doesn’t necessarily mean that sending aggregators is slow. It can for example mean that some workers finished computing before others and are now waiting for others to finish and send their data. How big are aggregators which you are using? Thanks, Maja From: Eric Kimbrel eric.kimb...@soteradefense.commailto:eric.kimb...@soteradefense.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, May 16, 2013 2:00 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Broadcast
Re: Custom halt condition
Hi Nicolas, You are right, using aggregators and master compute is the way to go. Please take a look at https://cwiki.apache.org/confluence/display/GIRAPH/Aggregators to learn more about aggregators. From the MasterCompute.compute() you will be calling haltComputation() when you decided it's time to do so. Please let me know if you have any questions. Maja On 3/29/13 3:16 AM, Nicolas Lalevée nicolas.lale...@hibnet.org wrote: Hi, In my use case (implementation of affinity propagation) I want to halt the computation if at least of minimum of vertex has voted to halt. As far as I understand the default is to halt if all vertex has voted to halt and no messages are sent between vertices. But in my use case, even if a vertex has voted to halt, it must sent and receive message in case there is a next superstep. And with some of my data, some vertex makes a lot of superstep to converge and vote to halt. Which I don't care much if there are a little percentage of theses. My current implementation create a fake master vertex which is gathering the convergence of all vertices via messages. And once that master decide it is time to halt the computation, it sends a message to all vertices so they all halt. But I have seen some thread here about some master compute, I have seen some code about aggregators, so I guess there is some smarter way of implementing this ? Nicolas
Re: Waiting for times required to be 19 (currently 18)
Nate, Are all the workers waiting for request from the same worker? (in the log waitSomeRequests: Waiting for request destTask is what you should look at) If so, check if there is some exception on that worker. You can also try decreasing giraph.maxRequestMilliseconds and see what happens after the request gets resent. Please let us know what you find out! Maja From: Nate touring_...@msn.commailto:touring_...@msn.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 21, 2013 11:16 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: RE: Waiting for times required to be 19 (currently 18) Hello Maja, Thank you for your reply and link to the issue. I last updated the code this week, and do infact have that issue checked-out in my local copy of the source. My compiled jar file of giraph-core is dated Feb 18th (three days ago). I will do another update from Git very soon and build and test again to be sure that the fix is in place and report back if the behavior changes. Thank you, Nate From: majakabi...@fb.commailto:majakabi...@fb.com To: user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Waiting for times required to be 19 (currently 18) Date: Thu, 21 Feb 2013 17:48:24 + Hi Nate, When did you take the new Giraph code? Please check if you have GIRAPH-506 patch in, if not that's probably the reason for the issue. Maja From: Nate touring_...@msn.commailto:touring_...@msn.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 21, 2013 8:06 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Waiting for times required to be 19 (currently 18) I recently upgraded older Giraph code built against CDH3 to a git checkout from a few days ago that builds against CDH4.1.0 (MRv1) libraries. All of the Giraph tests pass. When running my Giraph job with 20 workers, I usually get the above error in in 19 map processes: org.apache.giraph.utils.ExpectedBarrier: waitForRequiredPermits: Waiting for times required to be 19 (currently 18) One map worker always shows something like: org.apache.giraph.comm.netty.NettyClient: waitSomeRequests: Waiting interval of 15000 msecs, 1 open requests, waiting for it to be = 0,and some metrics org.apache.giraph.comm.netty.NettyClient: waitSomeRequests: Waiting for request (destTask=17, reqId=5032) - (reqId=5326,destAddr=host1:30017,elapsedNanos=..., started=..., writeDone=true, writeSuccess=true) repeats... I say this happens usually because the same giraph job does complete but only rarely. I have a timeout of 100 minutes set, and the job is killed after that much time has elapsed. Also, the started field in the above output in this past run reads: Wed Jan 21 14:21:31 EST 1970 All machines are synchronized by a single time server and currently read accurate times. I don't think it affected the execution, but it still seems erroneous. I also don't see Hadoop maps having status messages set on them. I see the GraphMapper giving the Context object to the GraphTaskManager instance, and I can see it calling context.setStatus(...) but those messages never show up in the map status column in the job tracker page. Is there something I've missed while upgrading the old code?
Re: Waiting for times required to be 19 (currently 18)
Nate, Great, glad to hear it works! We resend open requests after 10 minutes, so that's why you were seeing supersteps taking that long. Have fun with Giraph and let us know if you have any other questions. Maja From: Nate touring_...@msn.commailto:touring_...@msn.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 21, 2013 1:32 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: RE: Waiting for times required to be 19 (currently 18) Maja, Success! I did check and see that the giraph jar being used was dated 6-Feb, but many hours before your fix made it into the source tree. I probably forgot to put the new jar that I made earlier this week into the right place. How frustrating. I recompiled the very latest code, put the jar into the right place and have been able to execute the giraph job multiple times successfully. It even executes much faster than before, and the time to execute is reliable too. Time to execute used to vary between 10 and 20 minutes when Giraph was able to complete, but now takes between 70 to 80 seconds every time without any problems. Many thanks for fixing the original issue, and for replying to my email to the list. Nate From: majakabi...@fb.commailto:majakabi...@fb.com To: user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Waiting for times required to be 19 (currently 18) Date: Thu, 21 Feb 2013 20:04:53 + Nate, Are all the workers waiting for request from the same worker? (in the log waitSomeRequests: Waiting for request destTask is what you should look at) If so, check if there is some exception on that worker. You can also try decreasing giraph.maxRequestMilliseconds and see what happens after the request gets resent. Please let us know what you find out! Maja From: Nate touring_...@msn.commailto:touring_...@msn.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 21, 2013 11:16 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: RE: Waiting for times required to be 19 (currently 18) Hello Maja, Thank you for your reply and link to the issue. I last updated the code this week, and do infact have that issue checked-out in my local copy of the source. My compiled jar file of giraph-core is dated Feb 18th (three days ago). I will do another update from Git very soon and build and test again to be sure that the fix is in place and report back if the behavior changes. Thank you, Nate From: majakabi...@fb.commailto:majakabi...@fb.com To: user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Waiting for times required to be 19 (currently 18) Date: Thu, 21 Feb 2013 17:48:24 + Hi Nate, When did you take the new Giraph code? Please check if you have GIRAPH-506 patch in, if not that's probably the reason for the issue. Maja From: Nate touring_...@msn.commailto:touring_...@msn.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 21, 2013 8:06 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Waiting for times required to be 19 (currently 18) I recently upgraded older Giraph code built against CDH3 to a git checkout from a few days ago that builds against CDH4.1.0 (MRv1) libraries. All of the Giraph tests pass. When running my Giraph job with 20 workers, I usually get the above error in in 19 map processes: org.apache.giraph.utils.ExpectedBarrier: waitForRequiredPermits: Waiting for times required to be 19 (currently 18) One map worker always shows something like: org.apache.giraph.comm.netty.NettyClient: waitSomeRequests: Waiting interval of 15000 msecs, 1 open requests, waiting for it to be = 0,and some metrics org.apache.giraph.comm.netty.NettyClient: waitSomeRequests: Waiting for request (destTask=17, reqId=5032) - (reqId=5326,destAddr=host1:30017,elapsedNanos=..., started=..., writeDone=true, writeSuccess=true) repeats... I say this happens usually because the same giraph job does complete but only rarely. I have a timeout of 100 minutes set, and the job is killed after that much time has elapsed. Also, the started field in the above output in this past run reads: Wed Jan 21 14:21:31 EST 1970 All machines are synchronized by a single time server and currently read accurate times. I don't think it affected the execution, but it still seems erroneous. I also don't see Hadoop maps having status messages set on them. I see the GraphMapper giving the Context object to the GraphTaskManager instance, and I can see it calling
Re: Where can I find a simple Hello World example for Giraph
Hi Ryan, Before running the job, you need to set Vertex and input/output format classes on it. Please take a look at one of the benchmarks to see how to do that. Alternatively, you can try using GiraphRunner, where you pass these classes as command line arguments. Maja On 2/21/13 2:43 PM, Ryan Compton compton.r...@gmail.com wrote: I'm still struggling with this. I am trying to use 0.2, I dont have permissions to edit core-site.xml I think this the most basic boiler plate code for a 0.2 Giraph project, but I still can't run it. Exception in thread main java.lang.NullPointerException at org.apache.giraph.utils.ReflectionUtils.getTypeArguments(ReflectionUtils.j ava:85) at org.apache.giraph.conf.GiraphClasses.readFromConf(GiraphClasses.java:117) at org.apache.giraph.conf.GiraphClasses.init(GiraphClasses.java:105) at org.apache.giraph.conf.ImmutableClassesGiraphConfiguration.init(Immutabl eClassesGiraphConfiguration.java:84) at com.hrl.issl.osi.networks.HelloGiraph0p2.setConf(HelloGiraph0p2.java:34) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:61) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.hrl.issl.osi.networks.HelloGiraph0p2.main(HelloGiraph0p2.java:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm pl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) package networks; import java.io.IOException; import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; import org.apache.giraph.graph.GiraphJob; import org.apache.giraph.vertex.EdgeListVertex; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.log4j.Logger; /** * * Hello world giraph 0.2... * */ public class HelloGiraph0p2 extends EdgeListVertexLongWritable, Text, Text, Text implements Tool { /** Configuration */ private ImmutableClassesGiraphConfigurationLongWritable, Text, Text, Text conf; /** Class logger */ private static final Logger LOG = Logger.getLogger(HelloGiraph0p2.class); @Override public void compute(IterableText arg0) throws IOException { int four = 2+2; } @Override public void setConf(Configuration configurationIn) { this.conf = new ImmutableClassesGiraphConfigurationLongWritable, Text, Text, Text(configurationIn); return; } @Override public ImmutableClassesGiraphConfigurationLongWritable, Text, Text, Text getConf() { return conf; } /** * * ToolRunner run * * @param arg0 * @return * @throws Exception */ @Override public int run(String[] arg0) throws Exception { GiraphJob job = new GiraphJob(getConf(), getClass().getName()); return job.run(true) ? 0 : -1; } /** * main... * * @param args * @throws Exception */ public static void main(String[] args) throws Exception { System.exit(ToolRunner.run(new HelloGiraph0p2(), args)); } } On Tue, Feb 5, 2013 at 4:24 AM, Gustavo Enrique Salazar Torres gsala...@ime.usp.br wrote: Hi Ryan: I got that same error and discovered that I have to start a zookeeper instance. What I did was to download Zookeeper, write a new zoo.cfg file under conf directory with the following: dataDir=/home/user/zookeeper-3.4.5/tmp clientPort=2181 Also I added some lines in Hadoop's core-site.xml: property namegiraph.zkList/name valuelocalhost:2181/value /property Then I start Zookeper with bin/zkServer.sh start (also you will have to restart Hadoop) and then you can launch your Giraph Job. This setup worked for me (maybe there is an easiest way :D), hope it is useful. Best regards Gustavo On Mon, Feb 4, 2013 at 10:06 PM, Ryan Compton compton.r...@gmail.com wrote: Ok great, thanks. I've been working with 0.1, I can get things to compile (see below code) but they still are not running, the maps hang (also below). I have no idea how to fix it, I may consider updating that code I have that compiles to 0.2 and see if it works then. The only difference I can see is that 0.2 requires everything have a message -bash-3.2$ hadoop jar target/giraph-0.1-jar-with-dependencies.jar com.SimpleGiraphSumEdgeWeights /user/rfcompton/giraphTSPInput /user/rfcompton/giraphTSPOutput 3 3 13/02/04 15:48:23 INFO mapred.JobClient: Running job: job_201301230932_1199 13/02/04 15:48:24 INFO mapred.JobClient: map 0% reduce 0% 13/02/04 15:48:35 INFO mapred.JobClient: map 25% reduce 0% 13/02/04 15:58:40 INFO mapred.JobClient: Task Id : attempt_201301230932_1199_m_03_0, Status : FAILED java.lang.IllegalStateException: run: Caught an unrecoverable exception setup: Offlining servers due to exception... at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:641) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at
Re: InputFormat for the example SimpleMasterComputeVertex
I wrote up a basic info about aggregators: https://cwiki.apache.org/confluence/display/GIRAPH/Aggregators. Please take a look, and let me know if something needs to be changed / improved. From: Eli Reisman apache.mail...@gmail.commailto:apache.mail...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 21, 2013 4:21 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: InputFormat for the example SimpleMasterComputeVertex That sounds great to me, maybe just a mention in the wiki that the two functionalities are tied together will help the idea click for people. Either way this will be a big help I think. On Thu, Feb 21, 2013 at 3:24 PM, Maja Kabiljo majakabi...@fb.commailto:majakabi...@fb.com wrote: Eli, that's an interesting idea, we could have some class which user extends and which is there only for aggregator registration. Sometimes we want to register some aggregators later on during the computation, so we need to keep allowing registration from masterCompute too. But I think for users the biggest problem is to realize that they have to extend and set MasterCompute/this new class in order to use aggregators. Currently, if user tries to aggregate a value to unregistered aggregator he will get an exception, but if he tries to get the value of unregistered aggregator he will just get null. So maybe adding a warning message in that case, together with a wiki page, might be enough? What do you think? From: Eli Reisman apache.mail...@gmail.commailto:apache.mail...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 21, 2013 10:25 AM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: InputFormat for the example SimpleMasterComputeVertex Thanks for the explanation, that makes sense. I would love to see a wiki page at some point, you have so much knowledge of this piece of Giraph from all your dev work on it and have also the additional bonus of experience running big cluster jobs using these features so you have a lot of insight to share. Would there be any point to a future JIRA to break out the aggregator registration from the master compute stuff, at least from the user's view? Or is it not that confusing once you've used them a bit? On Thu, Feb 14, 2013 at 4:52 PM, Maja Kabiljo majakabi...@fb.commailto:majakabi...@fb.com wrote: Progressable exception can be caused by many different reasons (it's totally unrelated to aggregators), and when looking at which exception it's caused by users should get better sense about what's going on. What you are suggesting about providing default master compute is not doable, since the part which needs to be done there is aggregator registration. We can't know what kind of aggregators (names and types) an application needs. I remember I was talking about writing a short tutorial for aggregators long time ago, sorry for not doing that, will try to get to it soon. From: Eli Reisman apache.mail...@gmail.commailto:apache.mail...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 14, 2013 2:23 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: InputFormat for the example SimpleMasterComputeVertex Other folks on the list are also having this problem with the progressable utile exception job failures. I don't know much about master compute usage but if it is needed to make the aggregators work, maybe we should have a default dummy class that just handles aggregators if no other master compute is specified? Or a wiki page? The progressable error message does not lead us to this conclusion directly. On Wed, Feb 13, 2013 at 3:04 AM, Maria Stylianou mars...@gmail.commailto:mars...@gmail.com wrote: Hey, I am trying to run the example SimpleMasterComputeVertex, but no matter which Input Format and graph I give, it doesn't work. Each worker gives the error: Caused by: java.lang.NullPointerException at org.apache.giraph.examples.SimpleMasterComputeVertex.compute(SimpleMasterComputeVertex.java:42) This line 42 is the first line of the compute() public void compute(IterableDoubleWritable messages){ So I guess, the initialization is not done correctly, because the input file does not have the correct format. Any help would be appreciated, Thanks! Maria -- Maria Stylianou Intern at Telefonica, Barcelona, Spain Master Student of European Master in Distributed Computinghttp://www.kth.se/en/studies/programmes/master/em/emdc Universitat Politècnica de Catalunya - BarcelonaTech, Barcelona, Spain KTH Royal Institute of Technology
Re: InputFormat for the example SimpleMasterComputeVertex
Progressable exception can be caused by many different reasons (it's totally unrelated to aggregators), and when looking at which exception it's caused by users should get better sense about what's going on. What you are suggesting about providing default master compute is not doable, since the part which needs to be done there is aggregator registration. We can't know what kind of aggregators (names and types) an application needs. I remember I was talking about writing a short tutorial for aggregators long time ago, sorry for not doing that, will try to get to it soon. From: Eli Reisman apache.mail...@gmail.commailto:apache.mail...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Thursday, February 14, 2013 2:23 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: InputFormat for the example SimpleMasterComputeVertex Other folks on the list are also having this problem with the progressable utile exception job failures. I don't know much about master compute usage but if it is needed to make the aggregators work, maybe we should have a default dummy class that just handles aggregators if no other master compute is specified? Or a wiki page? The progressable error message does not lead us to this conclusion directly. On Wed, Feb 13, 2013 at 3:04 AM, Maria Stylianou mars...@gmail.commailto:mars...@gmail.com wrote: Hey, I am trying to run the example SimpleMasterComputeVertex, but no matter which Input Format and graph I give, it doesn't work. Each worker gives the error: Caused by: java.lang.NullPointerException at org.apache.giraph.examples.SimpleMasterComputeVertex.compute(SimpleMasterComputeVertex.java:42) This line 42 is the first line of the compute() public void compute(IterableDoubleWritable messages){ So I guess, the initialization is not done correctly, because the input file does not have the correct format. Any help would be appreciated, Thanks! Maria -- Maria Stylianou Intern at Telefonica, Barcelona, Spain Master Student of European Master in Distributed Computinghttp://www.kth.se/en/studies/programmes/master/em/emdc Universitat Politècnica de Catalunya - BarcelonaTech, Barcelona, Spain KTH Royal Institute of Technology, Stockholm, Sweden
Re: Can Giraph handle graphs with very large number of edges per vertex?
Hi Jeyendran, As Paolo mentioned, there were two patches to deal with out-of-core: GIRAPH-249 for out-of-core graph GIRAPH-45 for out-of-core messages For the graph part, currently assumption is that you have enough memory to keep at least one whole partition at the time. Options you need to set here are: giraph.useOutOfCoreGraph=true giraph.maxPartitionsInMemory= as much as you can keep For the messages, it's not necessary that messages for the whole partition fit in memory, since it streams on per vertex basis. There is however the constraint that all vertex ids (from all partitions) need to fit in memory, but for your application I understand that's not an issue. Options: giraph.useOutOfCoreMessages=true giraph.maxMessagesInMemory= as much as you can keep Also for messages, if you have a really heavy load and still run out of memory, you can also try using options from GIRAPH-287, since in practice it happens that messages are created much faster than they are actually transferred and processed on the destination, options there will prevent it from happening. But try it without these options first, since this can really slow down your application. You set: giraph.waitForRequestsConfirmation=true giraph.maxNumberOfOpenRequests= as much as you want Hope this helps, let us know if you have any other questions. Maja On 9/13/12 8:41 AM, Paolo Castagna castagna.li...@gmail.com wrote: Hi Jeyendran, interesting questions and IMHO it is not always easy to understand how many Giraph workers are necessary in order to process a specific (large) graph. A few more comments inline, but I am interested in the answers to your questions as well. On 13 September 2012 07:03, Jeyendran Balakrishnan j...@personaltube.com wrote: After reading both of your replies, I have some (final!) questions regarding memory usage: · For applications with a large number of edges per vextex: Are there any built-in vertex or helper classes or at least sample code which feature spilling of edges to disk, or some kind of disk-backed map of edges, to support such vertices? Or do we have to sort of roll our own? You'll probably need to roll your own (let's see what others suggest). However, if you do that, you should do it in the open so others can have a look, eventually help you and perhaps ensure that what you do might in future be contributed back to Giraph for others to benefit/use. A few months ago I had a look at this and I tried to use TDB (i.e. the storage layer available in Apache Jena) to store (and spill on disk) vertexes with Giraph. TDB uses B+Tree and memory mapped files. It's designed and tuned to store RDF, however it is not limited to RDF and someone might reuse it's low level indexing capabilities to store different graphs. Even if you do not use TDB, having a look at its sources might inspire you or give you some ideas and what you could do: https://svn.apache.org/repos/asf/jena/trunk/jena-tdb/src/main/java/com/hp/ hpl/jena/tdb/index/ · For graphs with a large number of vertices relative to available workers, at least in development phase, one may not always have access to a large number of workers, yet one might wish to process a very large graph. In these cases, it may happen that the workers may not be able to hold all their assigned vertices in memory. So again in this case, are there any built-in classes to allow spilling of vertices to disk, or a similar kind of disk-backed map? Here, I am not sure I understand where your need comes from. I usually develop and test everything locally, but while I do that I use a small dataset which it can be loaded in memory and allows me to iterate faster. Why do you need to use a large/read dataset in development phase? How large is your large number of vertices? Even if you use indexes and data structures on disk, as your dataset grow, the indexing and processing might take long time. So, perhaps, in development you are better off with small datasets anyway. · Assuming some kind of disk backing is implemented to handle large number of vertices/edges (under a situation of insufficient # of workers or memory per worker), is it likely that just the volume of IO (message/IPC) could cause OOMEs? Or merely slowdowns? There was work on spilling messages to disk and I found GIRAPH-249 (marked as resolved): https://issues.apache.org/jira/browse/GIRAPH-249 In general, I feel that one of the reasons for wide and rapid adoption of Hadoop is the ³download, install and run² feature, where even for large data sets, the stock code will still run to completion on a single laptop (or a single Linux server, etc), except that it will take more time. But this may be perfectly acceptable for people who are evaluating and experimenting, since there is no incurred cost for hardware. A lot of developers might be OK with giving the thing a run overnight on their laptops or fire up just one spot instance on EC2 etc and let it chug
Re: How to register aggregators with the 'new' Giraph?
I don't plan to change the API for aggregators anymore, only the way they are implemented is going to change (unless someone else has an objection/improvement to current API). So I can write the tutorial on how to use them already. We should probably make some plan for the pages structure on https://cwiki.apache.org/confluence/display/GIRAPH/Index, otherwise it's going to be a mess :-) So for example have a section with writing a simple application first, with some examples. And then a section with additional stuff, subsections for combiners, aggregators, master compute… What do you think? From: Eli Reisman apache.mail...@gmail.commailto:apache.mail...@gmail.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Tuesday, September 11, 2012 7:38 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: How to register aggregators with the 'new' Giraph? Hey Maja, A small tutorial on the wiki would be wonderful, either now or when the final changes to aggregators in the upcoming patches are done. We need a wiki entry for master compute too. I would like to go through and update some of the website examples as well regarding best practices with the new Vertex API, using the bin/giraph script and command line opts to set up jobs without writing your own run() method, implementing Tool, and writing your own IO Formats, etc. Thanks again! On Tue, Sep 11, 2012 at 9:36 AM, Paolo Castagna castagna.li...@gmail.commailto:castagna.li...@gmail.com wrote: Hi Maja, yep, your explanation makes sense. Clear now. Paiki On 11 September 2012 16:09, Maja Kabiljo majakabi...@fb.commailto:majakabi...@fb.com wrote: Hi Paolo, Glad to hear it works :-) The reason why you don't see the value you set with setAggregatedValue right away is that we want to read aggregated values from previous superstep and change them for next one. It goes the same with vertices where you call aggregate to give values for next superstep and read the values from previous. This is actually the part which wasn't working well before - it wasn't possible to get aggregated value without changes that vertices on the same worker made in current superstep. Hope this makes it clear for you. Maja On 9/11/12 12:45 PM, Paolo Castagna castagna.li...@gmail.commailto:castagna.li...@gmail.com wrote: Hi, the green bar is back. :-) I made multiple mistakes in relation to the new aggregators but now I believe I grasped how they work. For those interested the PageRankVertex, PageRankMasterCompute and PageRankWorkerContext are here: https://github.com/castagna/jena-grande/blob/9dd50837d6a13c542cce5d77a69ce a071a91cee8/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankV ertex.java https://github.com/castagna/jena-grande/blob/9dd50837d6a13c542cce5d77a69ce a071a91cee8/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankM asterCompute.java https://github.com/castagna/jena-grande/blob/9dd50837d6a13c542cce5d77a69ce a071a91cee8/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankW orkerContext.java There might be some further improvement left, but I'll try that another time. For example: registerPersistentAggregator(dangling-current, DoubleSumAggregator.class); registerPersistentAggregator(error-current, DoubleSumAggregator.class); Could probably be registerAggregator. I also noticed that within the compute() method if I call setAggregatedValue(name, ...) and getAggregatedValue(name) I don't seem to get the value set back. But the value is sent to the worker. This is not important, but it confuses me. I do agree with you, now the situation around aggregators is cleaner than before. Thank you for your help. Paolo PS: There is still a known failure in the tests, that is to show that the SimplePageRankVertex approach is too simple, it does not give back a probability distribution (i.e. sum at the end is not 1.0) and it does not take into account dangling nodes properly. On the other hand, PageRankVertex produces same results as two other implementations: one serial, all in memory and another one using JUNG. On 11 September 2012 11:03, Maja Kabiljo majakabi...@fb.commailto:majakabi...@fb.com wrote: Hi Paolo, You get null for aggregated value because aggregators haven't been registered yet in the moment WorkerContext.preApllication() is called. But I think that shouldn't be a problem since you can set initial values for aggregators in MasterCompute.initialize(). Please also note that you are not using the new aggregator api in the proper way. Function getAggregatedValue will return the value of the aggregator, not the aggregator object itself. It's not possible to set the value of the aggregators on workers (in methods from WorkerContext and Vertex), because that would produce nondeterministic results. You aggregate on workers and set values