I see this as a multi-part question. Looking back at some of the significant roadmap/existential questions asked in the last 12 months, I see the following:
1. With the introduction of SQL, what is the philosophy of Pig (I sent an email about this approximately 9 months ago) 2. What is the approach to support backward compatibility in Pig (Alan had sent an email about this 3 months ago) 3. Should Pig be a TLP (the current email thread). Here is my take on answering the aforementioned questions. The initial philosophy of Pig was to be backend agnostic. It was designed as a data flow language. Whenever a new language is designed, the syntax and semantics of the language have to be laid out. The syntax is usually captured in the form of a BNF grammar. The semantics are defined by the language creators. Backward compatibility is then a question of holding true to the syntax and semantics. With Pig, in addition to the language, the Java APIs were exposed to customers to implement UDFs (load/store/filter/grouping/row transformation etc), provision looping since the language does not support looping constructs and also support a programmatic mode of access. Backward compatibility in this context is to support API versioning. Do we still intend to position as a data flow language that is backend agnostic? If the answer is yes, then there is a strong case for making Pig a TLP. Are we influenced by Hadoop? A big YES! The reason Pig chose to become a Hadoop sub-project was to ride the Hadoop popularity wave. As a consequence, we chose to be heavily influenced by the Hadoop roadmap. Like a good lawyer, I also have rebuttals to Alan's questions :) 1. Search engine popularity - We can discuss this with the Hadoop team and still retain links to TLP's that are coupled (loosely or tightly). 2. Explicit connection to Hadoop - I see this as logical connection v/s physical connection. Today, we are physically connected as a sub-project. Becoming a TLP, will not increase/decrease our influence on the Hadoop community (think Logical, Physical and MR Layers :) 3. Philosophy - I have already talked about this. The tight coupling is by choice. If Pig continues to be a data flow language with clear syntax and semantics then someone can implement Pig on top of a different backend. Do we intend to take this approach? I just wanted to offer a different opinion to this thread. I strongly believe that we should think about the original philosophy. Will we have a Pig standards committee that will decide on the changes to the language (think C/C++) if there are multiple backend implementations? I will reserve my vote based on the outcome of the philosophy and backward compatibility discussions. If we decide that Pig will be treated and maintained like a true language with clear syntax and semantics then we have a strong case to make it into a TLP. If not, we should retain our existing ties to Hadoop and make Pig into a data flow language for Hadoop. Santhosh -----Original Message----- From: Thejas Nair [mailto:te...@yahoo-inc.com] Sent: Friday, April 02, 2010 4:08 PM To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy Subject: Re: Begin a discussion about Pig as a top level project I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and heavily influenced by its roadmap. I think it makes sense to continue as a sub-project of hadoop. -Thejas On 3/31/10 4:04 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: > Over time, Pig is increasing its coupling to Hadoop (for good > reasons), rather than decreasing it. If and when Pig becomes a viable > entity without hadoop around, it might make sense as a TLP. As is, I > think becoming a TLP will only introduce unnecessary administrative and bureaucratic headaches. > So my vote is also -1. > > -Dmitriy > > > > On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates <ga...@yahoo-inc.com> wrote: > >> So far I haven't seen any feedback on this. Apache has asked the >> Hadoop PMC to submit input in April on whether some subprojects >> should be promoted to TLPs. We, the Pig community, need to give >> feedback to the Hadoop PMC on how we feel about this. Please make your voice heard. >> >> So now I'll head my own call and give my thoughts on it. >> >> The biggest advantage I see to being a TLP is a direct connection to >> Apache. Right now all of the Pig team's interaction with Apache is >> through the Hadoop PMC. Being directly connected to Apache would >> benefit Pig team members who would have a better view into Apache. >> It would also raise our profile in Apache and thus make other projects more aware of us. >> >> However, I am concerned about loosing Pig's explicit connection to Hadoop. >> This concern has a couple of dimensions. One, Hadoop and MapReduce >> are the current flavor of the month in computing. Given that Pig >> shares a name with the common farm animal, it's hard to be sure based on search statistics. >> But Google trends shows that "hadoop" is searched on much more >> frequently than "hadoop pig" or "apache pig" (see >> http://www.google.com/trends?q=hadoop%2Chadoop+pig). I am guessing >> that most Pig users come from Hadoop users who discover Pig via Hadoop's website. >> Loosing that subproject tab on Hadoop's front page may radically >> lower the number of users coming to Pig to check out our project. I >> would argue that this benefits Hadoop as well, since high level >> languages like Pig Latin have the potential to greatly extend the user base and usability of Hadoop. >> >> Two, being explicitly connected to Hadoop keeps our two communities >> aware of each others needs. There are features proposed for MR that >> would greatly help Pig. By staying in the Hadoop community Pig is >> better positioned to advocate for and help implement and test those >> features. The response to this will be that Pig developers can still >> subscribe to Hadoop mailing lists, submit patches, etc. That is, >> they can still be part of the Hadoop community. Which reinforces my >> point that it makes more sense to leave Pig in the Hadoop community >> since Pig developers will need to be part of that community anyway. >> >> Finally, philosophically it makes sense to me that projects that are >> tightly connected belong together. It strikes me as strange to have >> Pig as a TLP completely dependent on another TLP. Hadoop was >> originally a subproject of Lucene. It moved out to be a TLP when it >> became obvious that Hadoop had become independent of and useful apart >> from Lucene. Pig is not in that position relative to Hadoop. >> >> So, I'm -1 on Pig moving out. But this is a soft -1. I'm open to >> being persuaded that I'm wrong or my concerns can be addressed while >> still having Pig as a TLP. >> >> Alan. >> >> >> On Mar 19, 2010, at 10:59 AM, Alan Gates wrote: >> >> You have probably heard by now that there is a discussion going on >> in the >>> Hadoop PMC as to whether a number of the subprojects (Hbase, Avro, >>> Zookeeper, Hive, and Pig) should move out from under the Hadoop >>> umbrella and become top level Apache projects (TLP). This >>> discussion has picked up recently since the Apache board has clearly >>> communicated to the Hadoop PMC that it is concerned that Hadoop is >>> acting as an umbrella project with many disjoint subprojects >>> underneath it. They are concerned that this gives Apache little >>> insight into the health and happenings of the subproject communities >>> which in turn means Apache cannot properly mentor those communities. >>> >>> The purpose of this email is to start a discussion within the Pig >>> community about this topic. Let me cover first what becoming TLP >>> would mean for Pig, and then I'll go into what options I think we as a community have. >>> >>> Becoming a TLP would mean that Pig would itself have a PMC that >>> would report directly to the Apache board. Who would be on the PMC >>> would be something we as a community would need to decide. Common >>> options would be to say all active committers are on the PMC, or all >>> active committers who have been a committer for at least a year. We >>> would also need to elect a chair of the PMC. This lucky person >>> would have no additional power, but would have the additional >>> responsibility of writing quarterly reports on Pig's status for >>> Apache board meetings, as well as coordinating with Apache to get >>> accounts for new committers, etc. For more information see >>> http://www.apache.org/foundation/how-it-works.html#roles >>> >>> Becoming a TLP would not mean that we are ostracized from the Hadoop >>> community. We would continue to be invited to Hadoop Summits, HUGs, etc. >>> Since all Pig developers and users are by definition Hadoop users, >>> we would continue to be a strong presence in the Hadoop community. >>> >>> I see three ways that we as a community can respond to this: >>> >>> 1) Say yes, we want to be a TLP now. >>> 2) Say yes, we want to be a TLP, but not yet. We feel we need more >>> time to mature. If we choose this option we need to be able to >>> clearly articulate how much time we need and what we hope to see >>> change in that time. >>> 3) Say no, we feel the benefits for us staying with Hadoop outweigh >>> the drawbacks of being a disjoint subproject. If we choose this, we >>> need to be able to say exactly what those benefits are and why we >>> feel they will be compromised by leaving the Hadoop project. >>> >>> There may other options that I haven't thought of. Please feel free >>> to suggest any you think of. >>> >>> Questions? Thoughts? Let the discussion begin. >>> >>> Alan. >>> >>> >>