Re: GSoC 2013

Gianmarco De Francisci Morales Tue, 09 Apr 2013 00:11:14 -0700

+1 to what Dmitriy says.

Cheers,


--
Gianmarco


On Mon, Apr 8, 2013 at 8:57 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Hi,
> I think this is an interesting project but is not core to "Pig" itself --
> it may be more interesting / viable as a standalone project on github that
> uses Pig to implement graph algorithms.
> At this point in its development, I feel that Pig needs to concentrate on
> doing the things it already does, and do them better (operator efficiency,
> storage efficiency, better MR plan generation, etc) rather than expand to
> specific verticals; we should allow our users to create their own solution
> suites that use Pig for specific purposes. A successful example of such a
> standalone project is PacketPig (https://github.com/packetloop/packetpig)
> ,
> a PCAP network capture analysis tool.
>
> D
>
>
> On Tue, Apr 2, 2013 at 9:48 AM, burakkk <[email protected]> wrote:
>
> > I know that but giraph tries to use bsp. What I'm saying is nothing
> shared
> > model except reducers. Besides I don't want to divide iteration. One
> phase
> > is still responsible for whole iteration. Every different origin vertex
> > will be processed in parallel.
> >
> > Thanks
> > Best regards...
> >
> >
> > On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales <
> > [email protected]
> > > wrote:
> >
> > > FYI, Giraph has a Random Walk implementation.
> > >
> > > Pig does not support iteration natively, so any iterative algorithm is
> > not
> > > a very good fit for it. Just my 2c.
> > >
> > > Cheers,
> > >
> > > --
> > > Gianmarco
> > >
> > >
> > > On Tue, Apr 2, 2013 at 10:04 AM, burakkk <[email protected]>
> wrote:
> > >
> > > > So what do you suggest? Is it clear?
> > > >
> > > >
> > > > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <[email protected]>
> > wrote:
> > > >
> > > > > I'm using only WTF graph representation to fit the memory. By the
> > way I
> > > > > haven't seen any explanation from the pig 0.11 release page about
> WTF
> > > or
> > > > > graph models.
> > > > > I don't wanna use Cassovary. I believe it can be done with pig. I
> > > > > implement a graph representation using WTF paper to pig and then
> I'll
> > > use
> > > > > it to implement random walk algorithm. To do that maybe I need to
> > > improve
> > > > > some features such as joins(fuzzy join) etc or implement a new
> > > operator.
> > > > I
> > > > > can implement it using either existing operators or new operators.
> > > That's
> > > > > up to us and it doesn't really matter. If there is already a
> > > > implementation
> > > > > to random walker algorithm, please feel free to tell. Because I
> > haven't
> > > > > found it.
> > > > > Are you proposing to create an open-source implementation of those
> > > > > algorithms?
> > > > > Yes, I'm proposing to implement a random walk algorithm, new data
> > model
> > > > > which is representing graph. After that, people can use it coding
> the
> > > > pig.
> > > > >
> > > > > Do you suggest they should be Pig scripts added to the Pig project,
> > or
> > > do
> > > > > you want to create some new operators?
> > > > > Maybe, it can be UDF or new operator.
> > > > >
> > > > > I made a quick example. It may not be completely accurate, I've
> just
> > > > tried
> > > > > to explain it.
> > > > > Think about you have a graph file just like that
> > > > > user_id follower
> > > > > 1 2
> > > > > 1 3
> > > > > 1 10
> > > > > 2 3
> > > > > 3 4
> > > > > 3 5
> > > > > ...
> > > > >
> > > > > Vertex List is an array including sorted vertex ids
> > > > > node List is a matrix including vertex id and its starting position
> > > > >
> > > > >
> > > > > graph = load 'graph' using PigStorage() (vertex:int, follower:int)
> -
> > > > > --load the graph file
> > > > > vertex = COGROUP graph BY (vertex);
> > > > > list = FOREACH vertex GENERATE
> org.apache.pig.generateVertex(vertex)
> > as
> > > > > vertexList; --load the whole vertexes from HDFS into the memory
> > > > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > > > > nodeList; --load the whole vertexes from HDFS into the memory
> > > > > randomWalk = FOREACH vertex GENERATE
> > > > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > > > generate a
> > > > > score using the node list you can traverse the graph to the your
> > > > finishing
> > > > > position
> > > > > store...
> > > > >
> > > > >
> > > > > Thanks
> > > > > Best Regards...
> > > > >
> > > > >
> > > > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <[email protected]
> >
> > > > wrote:
> > > > >
> > > > >> I'm somewhat familiar with WTF code (my day job is managing the
> > > > analytics
> > > > >> infrastructure team at Twitter). WTF is implemented using Pig 0.11
> > (in
> > > > >> fact
> > > > >> some of the Pig 11 features/improvements are directly due to this
> > > > >> project...), and mostly has to do with clever algorithms
> implemented
> > > in
> > > > >> Pig
> > > > >> (an earlier version of WTF loaded the graph into main memory on
> > > > large-mem
> > > > >> machines -- that system is open sourced, too, under
> > > > >> github.com/twitter/cassovary). Are you proposing to create an
> > > > open-source
> > > > >> implementation of those algorithms? Do you suggest they should be
> > Pig
> > > > >> scripts added to the Pig project, or do you want to create some
> new
> > > > >> operators? I'm not totally sure where you are going here.
> > > > >>
> > > > >> GSoC proposals for Pig are usually made by students who want to
> work
> > > on
> > > > >> issues labeled as GSoC candidates on the apache jira. The students
> > > spend
> > > > >> some time to understand the problem stated in the jira,
> familiarize
> > > > >> themselves with the existing codebase, and put a basic technical
> > > > >> implementation plan and schedule into their proposal. Since in
> this
> > > case
> > > > >> you are proposing something we haven't scoped or defined well for
> > > > >> ourselves, we need you to be very clear and specific about what
> you
> > > are
> > > > >> trying to do, and how you plan to go about it. I think that Graph
> > > > >> processing in Pig (or other Hadoop-based systems) is a really
> > > > interesting
> > > > >> topic and there is a lot of work to be done, but we really need
> you
> > to
> > > > be
> > > > >> far more detailed to be able to give you good guidance with
> regards
> > to
> > > > >> GSoC.
> > > > >>
> > > > >> Best,
> > > > >> Dmitriy
> > > > >>
> > > > >>
> > > > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <[email protected]
> >
> > > > wrote:
> > > > >>
> > > > >> > Sure. We can implement a graph model using  "WTF: The Who to
> > Follow
> > > > >> Service
> > > > >> > at Twitter article we can" article.This article's said that in
> > this
> > > > way
> > > > >> > graph can be stored one machine's memory so that every node will
> > > read
> > > > >> from
> > > > >> > HDFS and cache the graph to the memory. Every node is
> responsible
> > > from
> > > > >> its
> > > > >> > bucket edge to process. I mean it can be splitted. Every node
> can
> > be
> > > > >> > processed its bucket using random walk algorithm for instance.
> > > Finally
> > > > >> it
> > > > >> > can be reduced to get to the final results. I hope it's clear :)
> > > > >> >
> > > > >> > Thanks
> > > > >> > Best Regards...
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <
> > [email protected]
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi Burakk,
> > > > >> > > The general idea of making graph processing easier is a good
> > one.
> > > > I'm
> > > > >> not
> > > > >> > > sure what exactly you are proposing to do, though. Could you
> be
> > > more
> > > > >> > > detailed about what you are thinking?
> > > > >> > >
> > > > >> > >
> > > > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <
> > [email protected]>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi,
> > > > >> > > > I might be a little bit late. I come up with a new idea for
> > the
> > > > last
> > > > >> > > > minute. Currently I'm working on social graph processing. I
> > > think
> > > > we
> > > > >> > can
> > > > >> > > > implement a solution for pig.  With this idea I'm thinking
> to
> > > > apply
> > > > >> the
> > > > >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> > > > mentor
> > > > >> to
> > > > >> > do
> > > > >> > > > it with me?  Is there any suggestion? :)
> > > > >> > > >
> > > > >> > > > Details:
> > > > >> > > > Of course I can improve some join operations. I'm not sure
> is
> > > > there
> > > > >> any
> > > > >> > > > implementation about fuzzy joins for instance. These are the
> > > > papers
> > > > >> > that
> > > > >> > > I
> > > > >> > > > found
> > > > >> > > >
> > > > >> > > > Fuzzy Joins Using MapReduce
> > > > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > > > >> > > >
> > > > >> > > > Dimension independent similarity computation
> > > > >> > > > http://arxiv.org/abs/1206.2082
> > > > >> > > >
> > > > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw
> > > Away
> > > > >> > > > Everything That’s Not a Nail!
> > > > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > > >> > > >
> > > > >> > > > Large Graph Processing in the Cloud
> > > > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > > >> > > >
> > > > >> > > > ..etc
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Best regards..
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > >
> > > > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > >> > > > *
> > > > >> > > > *
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > >> > *
> > > > >> > *
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > > *
> > > > > *
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>

Re: GSoC 2013

Reply via email to