Yes I think after i emailed you I remembered we did talk about this briefly a few months back on the mailing list or a thread? Giraph has been coming along nicely, I agree! If you do put your solution up, let me know I'd love to see.
On Sun, Sep 16, 2012 at 2:24 AM, Paolo Castagna <castagna.li...@gmail.com>wrote: > Hi Eli, > mine was just at an attempt pre-GIRAPH-249 (and my attempt was not > successful because I had problems sub-classing MutableVertex at the > time, if I remember correctly. I think I shared some of my issues at > the time...). Anyway, now that GIRAPH-249 is closed, the need for that > is less or gone. I just need some time to look at Giraph source code > now and test/use what you did for GIRAPH-249. Giraph is making such > good progress and I still need to catch up. :-) > > If/when I do that and I see any valuable (i.e. faster) alternative, > I'll share it here or implement an example and share the code on > GitHub. > > Cheers, > Paolo > > On 14 September 2012 11:03, Eli Reisman <apache.mail...@gmail.com> wrote: > > Great discussion. I am very curious about the Apache Jena spill solution > you > > were speaking of, will check it out. What was your impression, was it > > successful for your uses? Sounds like it was not so hard to adapt for > this > > use? > > > > The good news is, Giraph recently acquired the ability to optionally > spill > > both messages and vertex data to disk to avoid overloads, and when > > configured right should provide the functionality you're looking for. > Even > > though Giraph rides atop the Hadoop framework, it performs its > calculations > > in a fundamentally different paradigm than MapReduce, so I doubt we will > > ever fully replicate the ability of Hadoop to trade cluster size for > > calculation time so transparently. Regarding the use of these new > features, > > there are threads on the Giraph JIRA list by Maja, Claudio, and > Alessandro > > regarding these issues I'd recommend reading. Try it out, and please let > us > > know how it goes for you. Its exciting for us to have these features > > available now. > >> > >> Many thanks for the point-by-point replies. It clarifies a lot of > >> questions I had. > >> > >> The Pregel papers did throw more light on the approach and architecture. > >> > >> > >> > >> Hi Eli, > >> > >> Your feedback about very large scale applications on Giraph sounds very > >> encouraging. Thanks very much. > >> > >> > >> > >> After reading both of your replies, I have some (final!) questions > >> regarding memory usage: > >> > >> · For applications with a large number of edges per vextex: Are > >> there any built-in vertex or helper classes or at least sample code > which > >> feature spilling of edges to disk, or some kind of disk-backed map of > edges, > >> to support such vertices? Or do we have to sort of roll our own? > >> > >> · For graphs with a large number of vertices relative to > available > >> workers, at least in development phase, one may not always have access > to a > >> large number of workers, yet one might wish to process a very large > graph. > >> In these cases, it may happen that the workers may not be able to hold > all > >> their assigned vertices in memory. So again in this case, are there any > >> built-in classes to allow spilling of vertices to disk, or a similar > kind of > >> disk-backed map? > >> > >> · Assuming some kind of disk backing is implemented to handle > >> large number of vertices/edges (under a situation of insufficient # of > >> workers or memory per worker), is it likely that just the volume of IO > >> (message/IPC) could cause OOMEs? Or merely slowdowns? > >> > >> > >> > >> In general, I feel that one of the reasons for wide and rapid adoption > of > >> Hadoop is the “download, install and run” feature, where even for large > data > >> sets, the stock code will still run to completion on a single laptop > (or a > >> single Linux server, etc), except that it will take more time. But this > may > >> be perfectly acceptable for people who are evaluating and experimenting, > >> since there is no incurred cost for hardware. A lot of developers might > be > >> OK with giving the thing a run overnight on their laptops or fire up > just > >> one spot instance on EC2 etc and let it chug along for a couple of days. > >> > >> I know this was the case for me when I was starting out with Hadoop. So > >> more nodes are needed only to speed things up, but not for > functionality. > >> > >> It might be great to include such features into Giraph also…. which will > >> require that disk backed workers be supported in the code as standard > >> feature… > >> > >> > >> > >> Would love to hear your thoughts on these… > >> > >> > >> > >> Thanks, > >> > >> Jeyendran > >> > >> > >> > >> > >> > >> From: Eli Reisman [mailto:apache.mail...@gmail.com] > >> Sent: Tuesday, September 11, 2012 12:11 PM > >> To: user@giraph.apache.org > >> Subject: Re: Can Giraph handle graphs with very large number of edges > per > >> vertex? > >> > >> > >> > >> Hi Jeyendran, I was just sayiing the same thing about the documentation > on > >> another thread, couldn't agree more. There will be progress on this > soon, I > >> promise. I'd like us to reach a model of "if you add a new feature or > change > >> a core feature, the patch gets committed contingent on a new wiki page > of > >> docs going up on the website." There's still nothing about our new > Vertex > >> API, master compute, etc. on the wiki. > >> > >> I would say 8 gigs to play with is a great amount where you will most > >> definitely be able to get very large interesting graphs to run > in-memory, > >> depending on how many workers (with 8G each) you have to work with. > having > >> 3-4 workers per machine is not a bad thing if you are provisioned to do > >> this. And lots of machines. This is a distributed batch processing > >> framework, so more is better ;) > >> > >> as far as vertices with a million edges, sure but it depends on how many > >> of them and your compute resources. Again, can't go into much detail but > >> Giraph has been extensively tested using real-world, large, interesting, > >> useful graph data. This includes large social graphs that have > supernodes. > >> So if you're supplying that, and you have the gear to run your data, > you've > >> picked the right tool. You can spill to disk, run in memory, or spread > the > >> load and scale to many, many workers (Mapper tasks) hosted on many > nodes and > >> Giraph will behave well if you have the compute resource to scale to fit > >> your volume of data. > >> > >> On Tue, Sep 11, 2012 at 12:27 AM, Avery Ching <ach...@apache.org> > wrote: > >> > >> Hi Jeyendran, nice to meet you. > >> > >> Answers inline. > >> > >> > >> > >> On 9/10/12 11:23 PM, Jeyendran Balakrishnan wrote: > >> > >> I am trying to understand what kind of data Giraph holds in memory per > >> worker. > >> My questions in descending order of importance: > >> 1. Does Giraph hold in memory exactly one vertex of data at a time, or > >> does > >> it need to hold all the vertexes assigned to that worker? > >> > >> All vertices assigned to that worker. > >> > >> > >> > >> 2. Can Giraph handle vertexes with, a million edges per vertex? > >> > >> Depends on how much memory you have. Would recommend making a custom > >> vertex implementation that has a very efficient store for better > scalability > >> (i.e. see IntIntNullIntVertex). > >> > >> > >> > >> If not, at what order of magnitude does it break down? - 1000 edges, > >> 10K > >> edges, 100K edges?... > >> (Of course, I understand that this depends upon the -Xmx value, so > >> let's > >> say we fix a value of -Xmx8g). > >> 3. Are there any limitations on the kind of objects that can be used as > >> vertices? > >> Specifically, does Giraph assume that vertices are lightweight (eg, > >> integer vertex ID + simple Java primitive vertex values + collection of > >> out-edges), > >> or can Giraph support heavyweight vertices (hold complex nested Java > >> objects in a vertex)? > >> > >> Limitations are that the vertex implementation must be Writable, the > >> vertex index must be WritableComparable, edge type Writable, message > type > >> Writable. > >> > >> > >> > >> 4. More generally, what data is stored in memory, and what, if any, is > >> offloaded/spilled to disk? > >> > >> Messages and vertices can be spilled to disk, but you must enable this. > >> > >> > >> > >> Would appreciate any light the experts can throw on this. > >> > >> On this note, I would like to mention that the presentations posted on > the > >> Wiki explain what Giraph can do, and how to use it from a coding > >> perspective, but there are no explanations of the design approach used, > >> the > >> rationale behind the choices, and the software architecture. I feel that > >> new > >> users can really benefit from a design and architecture document, along > >> the > >> lines of Hadoop and Lucene. For folks who are considering whether or > not > >> to > >> use Giraph, this can be a big help. The only alternative today is to > read > >> the source code, the burden of which might in itself be reason for folks > >> not > >> to consider using Giraph. > >> My 2c :-) > >> > >> > >> > >> Agreed that documentation is lacking =). That being said, the > >> presentations explain most of the design approach and reasons. I would > >> refer to the Pregel paper for a more detailed look or ask if you have > any > >> specific questions. > >> > >> > >> Thanks a lot, > >> > >> No problem! > >> > >> Jeyendran > >> > >> > >> > >> > > > > >