Re: Consider cleaning up backend code

2010-04-22 Thread Jianyong Dai
+1 for removing. This interface does not bring us any value when we 
decide to move closer to hadoop. Writing a backend is almost writing 
half of Pig. I don't think this interface is attractive to most 
developers. Instead, I +1 for Milind's idea to make intermediate 
artifacts available, or provide some hook for user to peek/morph the 
plan at different stages. This opens the door for developers to 
visualize/debug/improve Pig without knowing every details of Pig.


Daniel

Alan Gates wrote:
A couple of years ago we had this concept that Pig as is should be  
able to run on other backends (like say Dryad if it were open  
source).  So we built this whole backend interface and (mostly) kept  
Hadoop specific objects out of the front end.


Recently we have modified that stand and said that this implementation  
of Pig is Hadoop specific.  Pig Latin itself will still stay Hadoop  
independent.  So the ability to have multiple backends is fine.  But  
the ability to have non-Hadoop backends is not really interesting now.


So I at least see the proposal here as getting rid of generic code  
that tries to hide the fact that we are working on top of Hadoop  
(things like DataStorage and ExecutionEngine).


Alan.

On Apr 22, 2010, at 4:14 PM, Arun C Murthy wrote:

  
I read it as getting rid of concepts parallel to hadoop in  src/org/ 
apache/pig/backend/hadoop/datastorage.


Is that true?

thanks,
Arun

On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:


I kind of dig the concept of being able to plug in a different  
backend,
though I definitely thing we should get rid of the dead localmode  
code. Can
you give an example of how this will simplify the codebase? Is it  
more than
just GenericClass foo = new SpecificClass(), and the associated  
extra files?


-D

On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy   
wrote:


  

+1

Arun


On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:

Pig has an abstraction layer (interfaces and abstract classes) to

support multiple execution engines. After PIG-1053, Hadoop is the  
only
execution engine supported by Pig. I wonder if we should remove  
this
layer of code, and make Hadoop THE execution engine for Pig. This  
will

simplify a lot the backend code.



Thanks,

-Richard




  


  




Re: Broken build

2010-03-15 Thread Jianyong Dai

Hi, Dmitriy,
I just did a fresh build, and run test-commit, didn't see the problem. 
Besides, org.apache.pig.experimental.logical.optimizer.PlanPrinter is in 
the trunk. Can you double check?


Daniel

Dmitriy Ryaboy wrote:

Hi guys,
Trunk has been broken for a while. A bunch of tests in the test-commit
target fail, mostly due to "The import
org.apache.pig.experimental.logical.optimizer.PlanPrinter cannot be
resolved." Could someone check in the missing file?

-D
  




Re: [VOTE] Branch for Pig 0.6.0 release

2009-11-10 Thread Jianyong Dai
+1. I think Jeff's patch for the file system commands (PIG-891) also 
deserve some advertisement. Those commands are really handy to the end 
users.


Daniel

Alan Gates wrote:
+1.  In addition to the new features we've added, our change to use  
Hadoop's LineRecordReader brought Pig to parity with Hadoop in the  
PigMix tests, about a 30% average performance improvement.  This  
should be huge for our users.


Alan.

On Nov 9, 2009, at 12:26 PM, Olga Natkovich wrote:

  

Hi,



I would like to propose to branch for Pig 0.6.0 release with the  
intent
to have a release before the end of the year. We have done a lot of  
work

since branching for Pig 0.5.0 that we would like to share with users.
This includes changing how bags are spilled onto disk (PIG-975,
PIG-1037), skewed and fragment-replicated outer join plus many other
performance improvements and bug fixes.



Please vote by Thursday.



Thanks,



Olga








  




Re: [VOTE] Release Pig 0.4.0 (candidate 2)

2009-09-22 Thread Jianyong Dai
I removed ~/pigtest/conf/hadoop-site.xml and build piggybank again, all 
pass. For some reason MiniCluster do not regenerate hadoop-site.xml and 
reuse the old one, which happens to be wrong


Olga Natkovich wrote:

Hi,

The new version is available in
http://people.apache.org/~olga/pig-0.4.0-candidate-2/.

I see one failure in a unit test in piggybank (contrib.) but it is not
related to the functions themselves but seems to be an issue with
MiniCluster and I don't feel we need to chase this down. I made sure
that the same test runs ok with Hadoop 20.

Please, vote by end of day on Thursday, 9/24.

Olga

-Original Message-
From: Olga Natkovich [mailto:ol...@yahoo-inc.com]
Sent: Thursday, September 17, 2009 12:09 PM
To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org
Subject: [VOTE] Release Pig 0.4.0 (candidate 1)

Hi,

I have fixed the issue causing the failure that Alan reported.

Please test the new release:
http://people.apache.org/~olga/pig-0.4.0-candidate-1/.

Vote closes on Tuesday, 9/22.

Olga


-Original Message-
From: Olga Natkovich [mailto:ol...@yahoo-inc.com]
Sent: Monday, September 14, 2009 2:06 PM
To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org
Subject: [VOTE] Release Pig 0.4.0 (candidate 0)

Hi,



I created a candidate build for Pig 0.4.0 release. The highlights of
this release are



-  Performance improvements especially in the area of JOIN
support where we introduced two new join types: skew join to deal with
data skew and sort merge join to take advantage of the sorted data sets.

-  Support for Outer join.

-  Works with Hadoop 18



I ran the release audit and rat report looked fine. The relevant part is
attached below.



Keys used to sign the release are available at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.



Please download the release and try it out:
http://people.apache.org/~olga/pig-0.4.0-candidate-0.



Should we release this? Vote closes on Thursday, 9/17.



Olga





 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/CHANGES.txt
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/CHANG
ES.txt
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken-links.x
ml
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/cookbook.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/piglatin_refer
ence.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/piglatin_users
.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/tutorial.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/package-li
st
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes.
html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/missingS
inces.txt
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/user_com
ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
alldiffs_index_additions.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
alldiffs_index_all.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
alldiffs_index_changes.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
alldiffs_index_removals.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
changes-summary.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
classes_index_additions.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
classes_index_all.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
classes_index_changes.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
classes_index_removals.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
constructors_index_additions.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
constructors_index_all.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
constructors_index_changes.html
 [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/j

Re: Request for feedback: cost-based optimizer

2009-09-02 Thread Jianyong Dai
Yes, physical properties is important for an optimizer. To optimize Pig 
well, we need to know the underlying hadoop execution environment, such 
as # of map-reduce jobs, how many maps/reducers, how the job is 
configured, etc. This is true even for a rule based optimizer. 
Unfortunately, physical layer does not provide much physical information 
as the name suggests. Basically physical layer is a rephrase of logical 
layer using physical operators. Compare to logical operators, physical 
operators include implementation of pipeline processing but strip away 
many logical details such as "schema". Also, in logical layer, we have 
infrastructure to restructure logical operator such as move nodes 
around, swap nodes, etc, which does not exist in physical layer. From 
optimizer's point of view, physical layer does not give necessary 
information but more harder to deal with. If you would like to work with 
physical details, I think map-reduce layer is the right place to look 
at. However, restructure map-reduce layer is hard cuz we do not have all 
the infrastructure to move things around. Another approach is to use a 
combined logical layer and map-reduce layer for the optimization. In 
this, you restructure the logical layer by observing the physical 
details from map-reduce layer. The down side is that we have to tightly 
couple Pig to hadoop. But now Pig is a subproject of hadoop and almost 
all Pig users are using hadoop, I think it is fine to optimize thing 
towards hadoop.



Dmitriy Ryaboy wrote:

Our initial survey of related literature showed that the usual place
for a CBO tends to be between the physical and logical layer (in fact,
the famous Cascades paper advocates removing the distinction between
physical and logical operators altogether, and using an "is_logical"
and "is_physical" flag instead -- meaning an operator can be one,
both, or neither).

The reasoning is that you cannot properly determine a cost of a plan
if you don't know the physical "properties" of the operators that
implement it. An optimizer that works at a logical layer would by
definition create the same plan whether in local or mapreduce mode
(since such differences are abstracted from it). This is clearly
incorrect, as the properties of the environment in which these plans
are executed are drastically different.  Working at the physical layer
lets us stay close to the iron and adjust based on the specifics of
the execution environment.

Certainly one can posit a framework for a CBO that would set up the
necessary interfaces and plumbing for optimizing in any execution
mode, and invoke the proper implementations at run time; we are not
discounting that possibility (haven't gotten quite that far in the
design, to be honest).  But we feel that the implementations have to
be execution mode specific.

-Dmitriy

On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Dai wrote:
  

I am still reading but one interesting question is why you decide to put CBO
in physical layer?

Dmitriy Ryaboy wrote:


Whoops :-)
Here's the Google doc:

http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan
wrote:

  

Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's students
(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be useful
in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal








Re: Request for feedback: cost-based optimizer

2009-09-01 Thread Jianyong Dai
I am still reading but one interesting question is why you decide to put 
CBO in physical layer?


Dmitriy Ryaboy wrote:

Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan wrote:
  

Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's students
(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be useful
in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal