Hi Eli, mine was just at an attempt pre-GIRAPH-249 (and my attempt was not successful because I had problems sub-classing MutableVertex at the time, if I remember correctly. I think I shared some of my issues at the time...). Anyway, now that GIRAPH-249 is closed, the need for that is less or gone. I just need some time to look at Giraph source code now and test/use what you did for GIRAPH-249. Giraph is making such good progress and I still need to catch up. :-)
If/when I do that and I see any valuable (i.e. faster) alternative, I'll share it here or implement an example and share the code on GitHub. Cheers, Paolo On 14 September 2012 11:03, Eli Reisman <apache.mail...@gmail.com> wrote: > Great discussion. I am very curious about the Apache Jena spill solution you > were speaking of, will check it out. What was your impression, was it > successful for your uses? Sounds like it was not so hard to adapt for this > use? > > The good news is, Giraph recently acquired the ability to optionally spill > both messages and vertex data to disk to avoid overloads, and when > configured right should provide the functionality you're looking for. Even > though Giraph rides atop the Hadoop framework, it performs its calculations > in a fundamentally different paradigm than MapReduce, so I doubt we will > ever fully replicate the ability of Hadoop to trade cluster size for > calculation time so transparently. Regarding the use of these new features, > there are threads on the Giraph JIRA list by Maja, Claudio, and Alessandro > regarding these issues I'd recommend reading. Try it out, and please let us > know how it goes for you. Its exciting for us to have these features > available now. >> >> Many thanks for the point-by-point replies. It clarifies a lot of >> questions I had. >> >> The Pregel papers did throw more light on the approach and architecture. >> >> >> >> Hi Eli, >> >> Your feedback about very large scale applications on Giraph sounds very >> encouraging. Thanks very much. >> >> >> >> After reading both of your replies, I have some (final!) questions >> regarding memory usage: >> >> · For applications with a large number of edges per vextex: Are >> there any built-in vertex or helper classes or at least sample code which >> feature spilling of edges to disk, or some kind of disk-backed map of edges, >> to support such vertices? Or do we have to sort of roll our own? >> >> · For graphs with a large number of vertices relative to available >> workers, at least in development phase, one may not always have access to a >> large number of workers, yet one might wish to process a very large graph. >> In these cases, it may happen that the workers may not be able to hold all >> their assigned vertices in memory. So again in this case, are there any >> built-in classes to allow spilling of vertices to disk, or a similar kind of >> disk-backed map? >> >> · Assuming some kind of disk backing is implemented to handle >> large number of vertices/edges (under a situation of insufficient # of >> workers or memory per worker), is it likely that just the volume of IO >> (message/IPC) could cause OOMEs? Or merely slowdowns? >> >> >> >> In general, I feel that one of the reasons for wide and rapid adoption of >> Hadoop is the “download, install and run” feature, where even for large data >> sets, the stock code will still run to completion on a single laptop (or a >> single Linux server, etc), except that it will take more time. But this may >> be perfectly acceptable for people who are evaluating and experimenting, >> since there is no incurred cost for hardware. A lot of developers might be >> OK with giving the thing a run overnight on their laptops or fire up just >> one spot instance on EC2 etc and let it chug along for a couple of days. >> >> I know this was the case for me when I was starting out with Hadoop. So >> more nodes are needed only to speed things up, but not for functionality. >> >> It might be great to include such features into Giraph also…. which will >> require that disk backed workers be supported in the code as standard >> feature… >> >> >> >> Would love to hear your thoughts on these… >> >> >> >> Thanks, >> >> Jeyendran >> >> >> >> >> >> From: Eli Reisman [mailto:apache.mail...@gmail.com] >> Sent: Tuesday, September 11, 2012 12:11 PM >> To: user@giraph.apache.org >> Subject: Re: Can Giraph handle graphs with very large number of edges per >> vertex? >> >> >> >> Hi Jeyendran, I was just sayiing the same thing about the documentation on >> another thread, couldn't agree more. There will be progress on this soon, I >> promise. I'd like us to reach a model of "if you add a new feature or change >> a core feature, the patch gets committed contingent on a new wiki page of >> docs going up on the website." There's still nothing about our new Vertex >> API, master compute, etc. on the wiki. >> >> I would say 8 gigs to play with is a great amount where you will most >> definitely be able to get very large interesting graphs to run in-memory, >> depending on how many workers (with 8G each) you have to work with. having >> 3-4 workers per machine is not a bad thing if you are provisioned to do >> this. And lots of machines. This is a distributed batch processing >> framework, so more is better ;) >> >> as far as vertices with a million edges, sure but it depends on how many >> of them and your compute resources. Again, can't go into much detail but >> Giraph has been extensively tested using real-world, large, interesting, >> useful graph data. This includes large social graphs that have supernodes. >> So if you're supplying that, and you have the gear to run your data, you've >> picked the right tool. You can spill to disk, run in memory, or spread the >> load and scale to many, many workers (Mapper tasks) hosted on many nodes and >> Giraph will behave well if you have the compute resource to scale to fit >> your volume of data. >> >> On Tue, Sep 11, 2012 at 12:27 AM, Avery Ching <ach...@apache.org> wrote: >> >> Hi Jeyendran, nice to meet you. >> >> Answers inline. >> >> >> >> On 9/10/12 11:23 PM, Jeyendran Balakrishnan wrote: >> >> I am trying to understand what kind of data Giraph holds in memory per >> worker. >> My questions in descending order of importance: >> 1. Does Giraph hold in memory exactly one vertex of data at a time, or >> does >> it need to hold all the vertexes assigned to that worker? >> >> All vertices assigned to that worker. >> >> >> >> 2. Can Giraph handle vertexes with, a million edges per vertex? >> >> Depends on how much memory you have. Would recommend making a custom >> vertex implementation that has a very efficient store for better scalability >> (i.e. see IntIntNullIntVertex). >> >> >> >> If not, at what order of magnitude does it break down? - 1000 edges, >> 10K >> edges, 100K edges?... >> (Of course, I understand that this depends upon the -Xmx value, so >> let's >> say we fix a value of -Xmx8g). >> 3. Are there any limitations on the kind of objects that can be used as >> vertices? >> Specifically, does Giraph assume that vertices are lightweight (eg, >> integer vertex ID + simple Java primitive vertex values + collection of >> out-edges), >> or can Giraph support heavyweight vertices (hold complex nested Java >> objects in a vertex)? >> >> Limitations are that the vertex implementation must be Writable, the >> vertex index must be WritableComparable, edge type Writable, message type >> Writable. >> >> >> >> 4. More generally, what data is stored in memory, and what, if any, is >> offloaded/spilled to disk? >> >> Messages and vertices can be spilled to disk, but you must enable this. >> >> >> >> Would appreciate any light the experts can throw on this. >> >> On this note, I would like to mention that the presentations posted on the >> Wiki explain what Giraph can do, and how to use it from a coding >> perspective, but there are no explanations of the design approach used, >> the >> rationale behind the choices, and the software architecture. I feel that >> new >> users can really benefit from a design and architecture document, along >> the >> lines of Hadoop and Lucene. For folks who are considering whether or not >> to >> use Giraph, this can be a big help. The only alternative today is to read >> the source code, the burden of which might in itself be reason for folks >> not >> to consider using Giraph. >> My 2c :-) >> >> >> >> Agreed that documentation is lacking =). That being said, the >> presentations explain most of the design approach and reasons. I would >> refer to the Pregel paper for a more detailed look or ask if you have any >> specific questions. >> >> >> Thanks a lot, >> >> No problem! >> >> Jeyendran >> >> >> >> > >