I'm starting to work on the portion of the abstraction layer about the execution engine for the separation of front-end from back-end.
Based on some previous discussions with various folks, including Trevor Strohman from the Galago project, I think it is possible to identify some requirements/changes that I've summarize below (in addition to what is currently posted at: http://wiki.apache.org/pig/PigAbstractionLayer.) I would like to get some feedback on these points and whether I have left out aspects that'd need to be considered as well. Thanks, -a. Front-End: Change logical plan representation: goal is to change the representation of logical plans so that: * details pertaining to the physical query plan execution are not present anymore in the front-end; * a new logical plan submitted to the back-end can reference a portion (or alias) of another logical plan Aspects affected by the changes above are: 1. need to remove data collectors and logic to manage data-pipes from the eval specs and cond's of logical operators. These data structures are used in the case of the local execution mode. We can add physical eval specs and cond's where data pipes and data collectors are set up. This has the disadvantage of creating extra code (similar to the code for logical eval specs and logical cond's), but the overall separation of the logical aspects from the physical execution should be much cleaner. 2. need to remove the table of query results, where aliases are mapped to intermediate results. This data structure is populated when the logical plan is compiled. The concept of intermediate results does not seem to belong in the front-end. (Information about the generation of intermediate results will be maintained in the back-end) 3. extend representation of logical operators assigning to them a scope and a unique id within the scope. The motivation for doing this would be that new logical plans submitted to the back end can reference previous logical plans (or parts of it) via a (scope id, node id) pair. Having the concept of scope can provide support in the back-end for purging information about entities that go out of scope. For instance, the session id could be used as scope to garbage collect entities in the back-end no longer needed. 4. need to add a catalog that maps aliases to logical trees. For instance, when a store operation is encountered, the front-end can determine the set of dependent logical trees to serialize and send to the back-end or (scope, id) of previous plans to reference. 5. Serialization process from the front-end to the back-end can produce a representation of the logical plan and its dependencies that include (scope, id) of each operators to send to the back end. Back-End: 1. back-end would maintain table of intermediate results 2. compilation of logical plan to physical plan would take place in the back-end 3. a local back-end would generate physical trees using the physical eval specs and physical cond's (as described above) 4. a Hadoop back-end would compile logical plan to map/reduce
