Re: Algorithm implementations in Pig

Ankur C. Goel Mon, 22 Feb 2010 00:44:29 -0800

Ted,
     The latest pig release 0.6.0 on hadoop 20 is a clear winner not just for 
performance but also for doing a better job of managing memory in its MR job 
pipeline. Also support for both inner and outer skewed join is something that I 
found indispensable when dealing with really large datasets. There is support 
for streaming in pig that lets you stream your relation through an external 
perl/python/ruby... Script. Also support for UDFs in scripting language is 
expected in the near future.

About interfacing with other systems I assume you have an RDBMS in mind. There 
is a patch (for pig 0.7) that lets you write directly from PIG to an RDBMS like 
MySQL. Support for writing directly to Hbase was always there and has been 
improved I believe. With 0.7 release pig has decided to let its load/store 
functions rely on hadoop's input/output format so our vector format shouldn't 
be a problem IMHO. The only thing I am concerned about is the "not too 
efficient" Tuple implementation in pig which does not give performance 
equivalent to Java MR.

Recently I implemented shingling in Pig and found it to work beautifully. One 
problem that I hit had to too with using clusters to generate recommendations 
since some clusters were quite large (> 10 K). For this I needed to do a 
self-join and wanted the join load to be split evenly. That's where skewed join 
came to the rescue.

Apart from this I also want to contribute my implementation to Mahout (the 
reason for starting this thread :-))

-...@nkur

On 2/22/10 1:26 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote:

I have had both positive and negative results with PIG.

The positive results were that I was able to express large recommendation
computations in a very concise way.  That was really helpful.

My negative results have been to do with the brittle nature of PIG vis a vis
the version of the underlying hadoop system.  That problem may have abated
somewhat as everybody in the world except me and Amazon's EMR has pretty
much piled up on version 20.  I also know little about how Pig would
interface well with other components.  I know that I have had difficulty in
the past injecting outside information into Pig, but that has been
improved.  I also know that "Pigs eat anything", but have no clear idea how
well this would play out with, say, our vector formats and vectorizers.

Ankur, what recent experience do you have?  How well do PIG scripts play
with other programs any more?

On Sun, Feb 21, 2010 at 11:41 PM, Ankur C. Goel <gan...@yahoo-inc.com>wrote:

> I had Sean's opinion on this and he was not too comfortable with the Idea
> of having things in different languages in Mahout. However, given the
> benefits of PIG, I feel otherwise. I may be biased here due to my own
> experience of being able to do more in lesser time in Pig then in  M/R, so I
> thought let me ask how folks feel.
>
> Ted, I believe you have some PIG experience yourself so any thoughts on
> this ?
>

--
Ted Dunning, CTO
DeepDyve

Re: Algorithm implementations in Pig

Reply via email to