Re: Pregel article

Edward J. Yoon Mon, 05 Jul 2010 01:22:50 -0700

Wow, Thanks for sharing nice comment and blog post!! I'll look at it
soon and try to join to this discussion. :)


On Sat, Jul 3, 2010 at 1:08 AM, Claudio Martella
<[email protected]> wrote:
> I'll try to make things clear.
>
> The end of a superstep is obtained by the un-activation of all the
> vertices. In pregel the superstep is over when all the vertices call
> VoteToHalt() (in hama it's done by the sync() method). This happens at
> the end of each computation by each vertex. Each vertex is activated by
> the arrival of a message directed to that vertex. This means that each
> superstep computation is atomic and it should be considered in the
> design of the algorithm. That's a change of paradigm and that's what
> pregel author's call the "vertex's perspective programming".
>
> So no, there's no assumption about all the vertices to be active at the
> beginning of each superstep.
>
> About Felix's argument: yes, fewer supersteps mean less communication
> and synchronization overhead. At the same time, having longer supersteps
> will mean that it's more probable that certain vertices end their
> computation earlier than others, making them idle for a long time
> (waiting for the others to finish), and loosing computational time. So
> ideally it should be a good balance between long computational
> supersteps (decreasing communication overhead) and short computational
> supersteps (decreasing idle time).
> This is an intrinsical problem of BSP models because of the barrier. On
> the contrary DataFlow models don't have barriers and each computation is
> more independent, therefore more similar to the model you have in mind.
>
> Hope this helps.
>
> I attach the text from my blog post (roughly obtained with html2text) as
> requested.
>
> Cheers,
>
> Claudio
>
>
> zercal wrote:
>> The paper I found about pregel are not very detailed,
>> is it "http://portal.acm.org/citation.cfm?id=1582716.1582723";?
>> I guess, in this paper, vertices are assumed to be all actived at every 
>> superstep. simply random
>> access will reduce some communication cost but take more superstep.
>> However, is there any way of vertice selection method can be performed
>> that at every super step, each vertex knows whether to active according to
>> information it kept and received from other vertex?
>> But I can't found more detail from that paper...
>> Becides, I can not access your blogs. Would you please send me your article?
>>
>> Thank you very much!
>> from Xiong Chenyan
>>
>> ÔÚ2010-07-02 20:04:40£¬"Felix Halim" <[email protected]> Ð´µÀ£º
>>
>>> Exactly how to activate a particular vertex is not clear from the
>>> paper (is it random access?) and this feature is probably not as good
>>> as it sounds for complex graph algorithms. It might be better off to
>>> assume all vertices are active (to reduce the overhead of the flag
>>> needed and the space to make it randomly accessible, by storing it in
>>> blocks or whatever).
>>>
>>> Here is my argument:
>>>
>>> The way Pregel (and existing MR) works is iterative, where each
>>> iteration is separated by a super-step barrier where all messages have
>>> to arrive. Algorithms that have fewer super-steps are preferable than
>>> those that have large number of super-steps. In fact, we should
>>> measure Algorithms in terms of the number of super-steps required. To
>>> minimize the number of super-steps, likely we need to activate as much
>>> vertices as possible to do all the work in current super-step, rather
>>> than spill-over to the next super-step. In this case, the feature to
>>> "turn off" vertices is useless, since most of the time all vertices
>>> will be active to effectively reduce the number of super-steps.
>>>
>>> Unfortunately, I don't have experiments to backup my argument... I
>>> don't have Pregel...
>>>
>>> Felix Halim
>>>
>>>
>>> On Fri, Jul 2, 2010 at 7:37 PM, Claudio Martella
>>> <[email protected]> wrote:
>>>
>>>> I did too. See:
>>>>
>>>> http://blog.acaro.org/entry/pregel-is-out-but-what-is-pregel
>>>>
>>>>
>>>> Felix Halim wrote:
>>>>
>>>>> I have. See my comment in this blog:
>>>>>
>>>>> http://blog.udanax.org/2010/06/summary-of-google-pregel.html
>>>>>
>>>>> Felix Halim
>>>>>
>>>>>
>>>>> On Tue, Jun 8, 2010 at 4:00 AM, Mark Kerzner <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> anybody has read it?
>>>>>>
>>>>>> Thank you,
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> --
>>>> Claudio Martella
>>>> Digital Technologies
>>>> Unit Research & Development - Analyst
>>>>
>>>> TIS innovation park
>>>> Via Siemens 19 | Siemensstr. 19
>>>> 39100 Bolzano | 39100 Bozen
>>>> Tel. +39 0471 068 123
>>>> Fax  +39 0471 068 129
>>>> [email protected] http://www.tis.bz.it
>>>>
>>>> Short information regarding use of personal data. According to Section 13 
>>>> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that 
>>>> we process your personal data in order to fulfil contractual and fiscal 
>>>> obligations and also to send you information regarding our services and 
>>>> events. Your personal data are processed with and without electronic means 
>>>> and by respecting data subjects' rights, fundamental freedoms and dignity, 
>>>> particularly with regard to confidentiality, personal identity and the 
>>>> right to personal data protection. At any time and without formalities you 
>>>> can write an e-mail to [email protected] in order to object the processing 
>>>> of your personal data for the purpose of sending advertising materials and 
>>>> also to exercise the right to access personal data and other rights 
>>>> referred to in Section 7 of Decree 196/2003. The data controller is TIS 
>>>> Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find 
>>>> the complete information on the web site www.tis.bz.it.
>>>>
>>>>
>>>>
>>>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> [email protected] http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of 
> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
> process your personal data in order to fulfil contractual and fiscal 
> obligations and also to send you information regarding our services and 
> events. Your personal data are processed with and without electronic means 
> and by respecting data subjects' rights, fundamental freedoms and dignity, 
> particularly with regard to confidentiality, personal identity and the right 
> to personal data protection. At any time and without formalities you can 
> write an e-mail to [email protected] in order to object the processing of 
> your personal data for the purpose of sending advertising materials and also 
> to exercise the right to access personal data and other rights referred to in 
> Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation 
> Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete 
> information on the web site www.tis.bz.it.
>
>
>
>
>
> * ** ** ** ** ** * _ c c_ l l_ a a_ u u_ d d_ i i_ o o_  _ m m_ a a_ r r_ t 
> t_ e e_ l l_ l l_ a a * ** ** ** ** ** *
> * ** ** ** ** * _ [ [_ r r_ s s_ s s_  _ f f_ e e_ e e_ d d_ ] ] _ a a_ r r_ 
> c c_ h h_ i i_ v v_ e e _ a a_ b b_ o o_ u u_ t t * ** ** ** ** *
> * ** ** ** ** ** * _ G G_ o o_ o o_ g g_ l l_ e e_  _ P P_ r r_ e e_ g g_ e 
> e_ l l_  _ i i_ s s_  _ o o_ u u_ t t_ . ._  _ B B_ u u_ t t_  _ w w_ h h_ a 
> a_ t t_  _ i i_ s s_  _ P P_ r r_ e e_ g g_ e e_ l l_ ? ? * ** ** ** ** ** *
> June 21, 2010
> Google Pregel's _ p_ a_ p_ e_ r is finally out. But what is Pregel? It has 
> been mentioned
> in many posts talking about NoSQL, GrabhDBs, Big Data, even the Facebook
> OpenGraph, so it looks like, apart of the hype fuzz, there's a little 
> confusion
> about what it is, what it does, what it's good for and what it certainly is
> not. Let's start with the latter.
> Google Pregel is not a database (neither RDBMS nor NoSQL), no key-value store
> or any new means of storing data (big or small it might be). Putting it in the
> same lists with GraphDBs like _ N_ e_ o_ 4_ j, _ H_ y_ p_ e_ r_ G_ r_ a_ p_ 
> h_ D_ B or even Twitter's _ F_ l_ o_ c_ k_ D_ B is
> somehow like putting MapReduce in the NoSQL group.
> GraphDBs are storage systems that use graph representations for data where 
> each
> node represents an entity with unique ids, type and properties. An arc
> represents a relationship between two nodes and itself can have a type and
> properties. Think of a GraphDB as a RDBMS where instead of tables you have a
> graph. It's Semantic Web's triple stores brought to general purpose.
> Why would you use a GraphDB? Well, as you can describe your data in terms of
> entities and relationship, you're able to avoid defining a schema as we know
> it. It's a smaller step towards schema-less representation of data that actual
> NoSQLs provide.
> Informally, you can describe your data with ER diagrams without translating
> them into tables, keeping the representation dynamic and avoiding the costs of
> schema redefinition in RDBMs. Plus, it's very efficient and easy to write
> queries that allow you to get, for example, all the followers of a user, all
> the users he follows, the items related to him (tweets he wrote?) and maybe 
> the
> users connected to his items (users retweeting or reply his tweets?) in one 
> go.
> That's basically what Twitter wants to achieve with FlockDB, but with a 
> general
> GraphDB you can describe your data and the relationships between your data as
> you wish.
> You store data in a GraphDB and you recall it in an easy and efficient way. So
> what's Pregel good for? What if if you want to mine the data in the graph 
> (i.e.
> Google's Pagerank, Facebook's social network analysis, Twitter's retweeting/
> authority analysis)? Google reports that 80% of their distributed computation
> is based on MapReduce (Google Maps, Search Indexing, clustering in Google 
> News,
> reports of Google Trends, Google Translate etc.) so we can only guess that the
> rest 20% is based on Pregel and the authors report they can work with graphs 
> of
> the size of billions of vertices. Plus, implementing Pagerank is just about 15
> lines of code...
> That's what Pregel is for. So what is it? Pregel is a system for large-scale
> graph processing. It provides a fault-tolerant framework for the execution of
> graph algorithms in parallel over many machines. Think of it as MapReduce re-
> thought for graph operations.
> But what's wrong with MapReduce and graph algorithms? Nothing particularly,
> though it can lead to suboptimal performance because the graph state has to be
> passed from one phase to the other generating a lot of I/O, but in general we
> can say it has some usability issues as it doesn't provide a way to do any 
> per-
> vertex calculation. In general, it's not easy to express graph algorithms in 
> M/
> R. Pregel fills a gap as there are no frameworks for graph processing that
> address both distributability and fault-tolerance.
> Pregel's architecture is inspired by the Bulk Synchronous Parallel model
> introduced by Valiant. BSP is a computational model for the execution of
> parallel algorithms on top of multiple sequential Von Neumann machines. It
> gives an abstraction, just like M/R, that allows the programmer to think about
> the parallel expression of his solution without the hassle of communication 
> and
> memory allocation in a distributed system. Before we get into details I think
> two things have to be underlined.
> First, again like M/R, although the model is used by Google to distribute
> computation among multiple computers that's not necessary, in principle BSP
> fits parallel programming on SMP or NUMA machines and mainframes.
> Second, although the model is used by Google to distribute graph processing,
> BSP can be used to distribute other kind of algorithms like matrix
> manipulation, just like M/R.
> Ok, how does BSP work? I'll take the diagram and snippet from the _ B_ S_ P_  
> _ p_ a_ g_ e_  _ f_ r_ o_ m
> _ W_ i_ k_ i_ p_ e_ d_ i_ a:
> ‚ÄúA BSP computer consists of processors connected by a communication network.
> Each processor has a fast local memory, and may follow different threads of
> computation.
> A BSP computation proceeds in a series of global supersteps. A superstep
> consists of three ordered stages:
>   1. Concurrent computation: Several computations take place on every
>      participating processor. Each process only uses values stored in the
>      local memory of the processor. The computations are independent in the
>      sense that they occur asynchronously of all the others.
>   2. Communication: At this stage, the processes exchange data between
>      themselves.
>   3. Barrier synchronisation: When a process reaches this point (the barrier),
>      it waits until all other processes have finished their communication
>      actions.
> The figure below shows this in a diagrammatic form. The processes are not
> regarded as having a particular linear order (from left to right or 
> otherwise),
> and may be mapped to processors in any way.‚Äù
> [bsp architecture]
> Ok, basically at every superstep every processor executes the same algorithm 
> on
> its data: its state and the incoming messages. At superstep t every processor
> will work on its state, which is the result of its computation at superstep t-
> 1, and the messages sent to him at superstep t-1. As a result of the superstep
> t computation the processor will send messages to other processors and these
> messages will be the incoming messages at superstep t+1. And the cycle goes 
> on.
> The barrier synchronisation is the moment where t gets to be t+1.
> It is easy to see that each computation should take approximately the same
> amount of time, otherwise a long lasting computation will force the others to
> wait idle.
> How does Pregel implement BSP? Quoting Pregel's original paper: ‚ÄúThe input 
> to
> a Pregel computation is a directed graph in which each vertex is uniquely
> identiÔ¨Åed by a string vertex identiÔ¨Åer. Each vertex is associated with a
> modiÔ¨Åable, user deÔ¨Åned value. The directed edges are associated with their
> source vertices, and each edge consists of a modiÔ¨Åable, user deÔ¨Åned value
> and a target vertex identiÔ¨Åer. A typical Pregel computation consists of
> input, when the graph is initialized, followed by a sequence of supersteps
> separated by global synchronization points until the algorithm terminates, and
> Ô¨Ånishing with output.
> Within each superstep the vertices compute in parallel, each executing the 
> same
> user-deÔ¨Åned function that expresses the logic of a given algorithm. A vertex
> can modify its state or that of its outgoing edges, receive messages sent to 
> it
> in the previous superstep, send messages to other vertices (to be received in
> the next superstep), or even mutate the topology of the graph. Edges are not
> Ô¨Årst-class citizens in this model, having no associated computation.
> Algorithm termination is based on every vertex voting to halt. In superstep 0,
> every vertex is in the active state; all active vertices participate in the
> computation of any given superstep. A vertex deactivates itself by voting to
> halt. This means that the vertex has no further work to do unless triggered
> externally, and the Pregel framework will not execute that vertex in 
> subsequent
> supersteps unless it receives a message. If reactivated by a message, a vertex
> must explicitly deactivate itself again. The algorithm as a whole terminates
> when all vertices are simultaneously inactive and there are no messages in
> transit.‚Äù
> The mapping between BSP and Pregel is very simple: each local computation of
> BSP maps to the user-defined function in Pregel and the communication often,
> but not necessarily, corresponds to edge connectivity between nodes. The
> barrier is defined by the halt voting of all the active nodes.
> From the perspective of the API Pregel requires the implementation of the
> virtual Compute() method of the class Vertex. The class Vertex itself provides
> VoteToHalt(), SendMessageTo(), GetValue(), GetOutEdgeIterator() and const
> methods superstep() and vertex_id().
> Like M/R, it provides the possibility to define Combiners in order to reduce
> message passing overhead by combining messages together where semantically
> possible. Like Sawzall Pregel provides Aggregators which allow global
> communication by receiving messages from multiple vertices, combining them and
> sending the result back to the vertices. They are useful for statistics (think
> of an histogram of vertex degrees) or for global controlling (for example an
> aggregator can collect all the vertices' PageRank deltas to calculate the
> convergence condition).
> From the perspective of architecture, Pregel follows a master/worker
> architecture, like most of the other Google frameworks. The master is
> responsible of partitioning the graph with a hash function based on the vertex
> ID (like hash(ID) mod #partitions although a topology-aware partitioner might
> be able to minimize communication between workers by keeping messages intra-
> machine) but doesn't compute any partition. At the beginning of computation 
> the
> workers subscribe to the computation to the master.
> Once the graph is partitioned and the partitions are assigned to workers, the
> master issues the start of the superstep. Each worker loops through all his
> active vertices, calling the Compute() method and delivering the messages
> collected in the previous superstep. The new messages are delivered before the
> end of the superstep, right before telling the master the list of active
> vertices for the next superstep.
> After the computation halts the master might ask the workers to dump their
> graph partition to disk.
> At the moment there are no projects that handle such computational power over
> graphs. _ P_ a_ r_ a_ l_ l_ e_ l_  _ B_ G_ L and _ C_ G_ M_ L_ i_ b can 
> handle parallel processing on graphs but
> don't scale to this size and is not fault-tolerant. _ H_ a_ m_ a, an Apache 
> incubated
> project, aims at developing a similar model to Pregel, but it's not complete
> yet.
> So, where do I go now?
> _ G_ o_ o_ g_ l_ e_ '_ s_  _ R_ e_ s_ e_ a_ r_ c_ h_  _ B_ l_ o_ g_  _ a_ n_ 
> n_ o_ u_ n_ c_ e_ m_ e_ n_ t
> _ B_ S_ P_  _ W_ o_ r_ l_ d_ -_ w_ i_ d_ e
> _ N_ o_ S_ Q_ L_  _ G_ r_ a_ p_ h_ D_ B
>  Please enable JavaScript to view the _ c_ o_ m_ m_ e_ n_ t_ s_  _ p_ o_ w_ 
> e_ r_ e_ d_  _ b_ y_  _ D_ i_ s_ q_ u_ s_ . _ b_ l_ o_ g_  _ c_ o_ m_ m_ e_ 
> n_ t_ s
> _ p_ o_ w_ e_ r_ e_ d_  _ b_ y_  _ D_ i_ s_ q_ u_ s
>
>



-- 
Best Regards, Edward J. Yoon
[email protected]
http://blog.udanax.org

Re: Pregel article

Reply via email to