+1 to what Dmitriy says. Cheers,
-- Gianmarco On Mon, Apr 8, 2013 at 8:57 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > Hi, > I think this is an interesting project but is not core to "Pig" itself -- > it may be more interesting / viable as a standalone project on github that > uses Pig to implement graph algorithms. > At this point in its development, I feel that Pig needs to concentrate on > doing the things it already does, and do them better (operator efficiency, > storage efficiency, better MR plan generation, etc) rather than expand to > specific verticals; we should allow our users to create their own solution > suites that use Pig for specific purposes. A successful example of such a > standalone project is PacketPig (https://github.com/packetloop/packetpig) > , > a PCAP network capture analysis tool. > > D > > > On Tue, Apr 2, 2013 at 9:48 AM, burakkk <burak.isi...@gmail.com> wrote: > > > I know that but giraph tries to use bsp. What I'm saying is nothing > shared > > model except reducers. Besides I don't want to divide iteration. One > phase > > is still responsible for whole iteration. Every different origin vertex > > will be processed in parallel. > > > > Thanks > > Best regards... > > > > > > On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales < > > g...@gdfm.me > > > wrote: > > > > > FYI, Giraph has a Random Walk implementation. > > > > > > Pig does not support iteration natively, so any iterative algorithm is > > not > > > a very good fit for it. Just my 2c. > > > > > > Cheers, > > > > > > -- > > > Gianmarco > > > > > > > > > On Tue, Apr 2, 2013 at 10:04 AM, burakkk <burak.isi...@gmail.com> > wrote: > > > > > > > So what do you suggest? Is it clear? > > > > > > > > > > > > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <burak.isi...@gmail.com> > > wrote: > > > > > > > > > I'm using only WTF graph representation to fit the memory. By the > > way I > > > > > haven't seen any explanation from the pig 0.11 release page about > WTF > > > or > > > > > graph models. > > > > > I don't wanna use Cassovary. I believe it can be done with pig. I > > > > > implement a graph representation using WTF paper to pig and then > I'll > > > use > > > > > it to implement random walk algorithm. To do that maybe I need to > > > improve > > > > > some features such as joins(fuzzy join) etc or implement a new > > > operator. > > > > I > > > > > can implement it using either existing operators or new operators. > > > That's > > > > > up to us and it doesn't really matter. If there is already a > > > > implementation > > > > > to random walker algorithm, please feel free to tell. Because I > > haven't > > > > > found it. > > > > > Are you proposing to create an open-source implementation of those > > > > > algorithms? > > > > > Yes, I'm proposing to implement a random walk algorithm, new data > > model > > > > > which is representing graph. After that, people can use it coding > the > > > > pig. > > > > > > > > > > Do you suggest they should be Pig scripts added to the Pig project, > > or > > > do > > > > > you want to create some new operators? > > > > > Maybe, it can be UDF or new operator. > > > > > > > > > > I made a quick example. It may not be completely accurate, I've > just > > > > tried > > > > > to explain it. > > > > > Think about you have a graph file just like that > > > > > user_id follower > > > > > 1 2 > > > > > 1 3 > > > > > 1 10 > > > > > 2 3 > > > > > 3 4 > > > > > 3 5 > > > > > ... > > > > > > > > > > Vertex List is an array including sorted vertex ids > > > > > node List is a matrix including vertex id and its starting position > > > > > > > > > > > > > > > graph = load 'graph' using PigStorage() (vertex:int, follower:int) > - > > > > > --load the graph file > > > > > vertex = COGROUP graph BY (vertex); > > > > > list = FOREACH vertex GENERATE > org.apache.pig.generateVertex(vertex) > > as > > > > > vertexList; --load the whole vertexes from HDFS into the memory > > > > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as > > > > > nodeList; --load the whole vertexes from HDFS into the memory > > > > > randomWalk = FOREACH vertex GENERATE > > > > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; -- > > > > generate a > > > > > score using the node list you can traverse the graph to the your > > > > finishing > > > > > position > > > > > store... > > > > > > > > > > > > > > > Thanks > > > > > Best Regards... > > > > > > > > > > > > > > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dvrya...@gmail.com > > > > > > wrote: > > > > > > > > > >> I'm somewhat familiar with WTF code (my day job is managing the > > > > analytics > > > > >> infrastructure team at Twitter). WTF is implemented using Pig 0.11 > > (in > > > > >> fact > > > > >> some of the Pig 11 features/improvements are directly due to this > > > > >> project...), and mostly has to do with clever algorithms > implemented > > > in > > > > >> Pig > > > > >> (an earlier version of WTF loaded the graph into main memory on > > > > large-mem > > > > >> machines -- that system is open sourced, too, under > > > > >> github.com/twitter/cassovary). Are you proposing to create an > > > > open-source > > > > >> implementation of those algorithms? Do you suggest they should be > > Pig > > > > >> scripts added to the Pig project, or do you want to create some > new > > > > >> operators? I'm not totally sure where you are going here. > > > > >> > > > > >> GSoC proposals for Pig are usually made by students who want to > work > > > on > > > > >> issues labeled as GSoC candidates on the apache jira. The students > > > spend > > > > >> some time to understand the problem stated in the jira, > familiarize > > > > >> themselves with the existing codebase, and put a basic technical > > > > >> implementation plan and schedule into their proposal. Since in > this > > > case > > > > >> you are proposing something we haven't scoped or defined well for > > > > >> ourselves, we need you to be very clear and specific about what > you > > > are > > > > >> trying to do, and how you plan to go about it. I think that Graph > > > > >> processing in Pig (or other Hadoop-based systems) is a really > > > > interesting > > > > >> topic and there is a lot of work to be done, but we really need > you > > to > > > > be > > > > >> far more detailed to be able to give you good guidance with > regards > > to > > > > >> GSoC. > > > > >> > > > > >> Best, > > > > >> Dmitriy > > > > >> > > > > >> > > > > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <burak.isi...@gmail.com > > > > > > wrote: > > > > >> > > > > >> > Sure. We can implement a graph model using "WTF: The Who to > > Follow > > > > >> Service > > > > >> > at Twitter article we can" article.This article's said that in > > this > > > > way > > > > >> > graph can be stored one machine's memory so that every node will > > > read > > > > >> from > > > > >> > HDFS and cache the graph to the memory. Every node is > responsible > > > from > > > > >> its > > > > >> > bucket edge to process. I mean it can be splitted. Every node > can > > be > > > > >> > processed its bucket using random walk algorithm for instance. > > > Finally > > > > >> it > > > > >> > can be reduced to get to the final results. I hope it's clear :) > > > > >> > > > > > >> > Thanks > > > > >> > Best Regards... > > > > >> > > > > > >> > > > > > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy < > > dvrya...@gmail.com > > > > > > > > >> > wrote: > > > > >> > > > > > >> > > Hi Burakk, > > > > >> > > The general idea of making graph processing easier is a good > > one. > > > > I'm > > > > >> not > > > > >> > > sure what exactly you are proposing to do, though. Could you > be > > > more > > > > >> > > detailed about what you are thinking? > > > > >> > > > > > > >> > > > > > > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk < > > burak.isi...@gmail.com> > > > > >> wrote: > > > > >> > > > > > > >> > > > Hi, > > > > >> > > > I might be a little bit late. I come up with a new idea for > > the > > > > last > > > > >> > > > minute. Currently I'm working on social graph processing. I > > > think > > > > we > > > > >> > can > > > > >> > > > implement a solution for pig. With this idea I'm thinking > to > > > > apply > > > > >> the > > > > >> > > > GSOC 2013 so that I can do some tasks about it. Is there any > > > > mentor > > > > >> to > > > > >> > do > > > > >> > > > it with me? Is there any suggestion? :) > > > > >> > > > > > > > >> > > > Details: > > > > >> > > > Of course I can improve some join operations. I'm not sure > is > > > > there > > > > >> any > > > > >> > > > implementation about fuzzy joins for instance. These are the > > > > papers > > > > >> > that > > > > >> > > I > > > > >> > > > found > > > > >> > > > > > > > >> > > > Fuzzy Joins Using MapReduce > > > > >> > > > http://ilpubs.stanford.edu:8090/1006/ > > > > >> > > > > > > > >> > > > Dimension independent similarity computation > > > > >> > > > http://arxiv.org/abs/1206.2082 > > > > >> > > > > > > > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw > > > Away > > > > >> > > > Everything That’s Not a Nail! > > > > >> > > > http://arxiv.org/pdf/1209.2191.pdf > > > > >> > > > > > > > >> > > > Large Graph Processing in the Cloud > > > > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf > > > > >> > > > > > > > >> > > > ..etc > > > > >> > > > > > > > >> > > > Thanks > > > > >> > > > Best regards.. > > > > >> > > > > > > > >> > > > > > > > >> > > > -- > > > > >> > > > > > > > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* > > > > >> > > > * > > > > >> > > > * > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > -- > > > > >> > > > > > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* > > > > >> > * > > > > >> > * > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* > > > > > * > > > > > * > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* > > > > * > > > > * > > > > > > > > > > > > > > > -- > > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* > > * > > * > > >