Ted, The latest pig release 0.6.0 on hadoop 20 is a clear winner not just for performance but also for doing a better job of managing memory in its MR job pipeline. Also support for both inner and outer skewed join is something that I found indispensable when dealing with really large datasets. There is support for streaming in pig that lets you stream your relation through an external perl/python/ruby... Script. Also support for UDFs in scripting language is expected in the near future.
About interfacing with other systems I assume you have an RDBMS in mind. There is a patch (for pig 0.7) that lets you write directly from PIG to an RDBMS like MySQL. Support for writing directly to Hbase was always there and has been improved I believe. With 0.7 release pig has decided to let its load/store functions rely on hadoop's input/output format so our vector format shouldn't be a problem IMHO. The only thing I am concerned about is the "not too efficient" Tuple implementation in pig which does not give performance equivalent to Java MR. Recently I implemented shingling in Pig and found it to work beautifully. One problem that I hit had to too with using clusters to generate recommendations since some clusters were quite large (> 10 K). For this I needed to do a self-join and wanted the join load to be split evenly. That's where skewed join came to the rescue. Apart from this I also want to contribute my implementation to Mahout (the reason for starting this thread :-)) -...@nkur On 2/22/10 1:26 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote: I have had both positive and negative results with PIG. The positive results were that I was able to express large recommendation computations in a very concise way. That was really helpful. My negative results have been to do with the brittle nature of PIG vis a vis the version of the underlying hadoop system. That problem may have abated somewhat as everybody in the world except me and Amazon's EMR has pretty much piled up on version 20. I also know little about how Pig would interface well with other components. I know that I have had difficulty in the past injecting outside information into Pig, but that has been improved. I also know that "Pigs eat anything", but have no clear idea how well this would play out with, say, our vector formats and vectorizers. Ankur, what recent experience do you have? How well do PIG scripts play with other programs any more? On Sun, Feb 21, 2010 at 11:41 PM, Ankur C. Goel <gan...@yahoo-inc.com>wrote: > I had Sean's opinion on this and he was not too comfortable with the Idea > of having things in different languages in Mahout. However, given the > benefits of PIG, I feel otherwise. I may be biased here due to my own > experience of being able to do more in lesser time in Pig then in M/R, so I > thought let me ask how folks feel. > > Ted, I believe you have some PIG experience yourself so any thoughts on > this ? > -- Ted Dunning, CTO DeepDyve