I'm starting to work on the portion of the abstraction layer about the
execution engine for the separation of front-end from back-end. 

 

Based on some previous discussions with various folks, including Trevor
Strohman from the Galago project, I think it is possible to identify
some requirements/changes that I've summarize below (in addition to what
is currently posted at: http://wiki.apache.org/pig/PigAbstractionLayer.)

 

I would like to get some feedback on these points and whether I have
left out aspects that'd need to be considered as well.

 

Thanks,

-a.

 

 

Front-End:

Change logical plan representation: goal is to change the representation
of logical plans so that: 

*         details pertaining to the physical query plan execution are
not present anymore in the front-end; 

*         a new logical plan submitted to the back-end can reference a
portion (or alias) of another logical plan

 

Aspects affected by the changes above are:

1.      need to remove data collectors and logic to manage data-pipes
from the eval specs and cond's of logical operators. These data
structures are used in the case of the local execution mode. We can add
physical eval specs and cond's where data pipes and data collectors are
set up. This has the disadvantage of creating extra code (similar to the
code for logical eval specs and logical cond's), but the overall
separation of the logical aspects from the physical execution should be
much cleaner.
2.      need to remove the table of query results, where aliases are
mapped to intermediate results. This data structure is populated when
the logical plan is compiled. The concept of intermediate results does
not seem to belong in the front-end. (Information about the generation
of intermediate results will be maintained in the back-end)
3.      extend representation of logical operators assigning to them a
scope and a unique id within the scope. The motivation for doing this
would be that new logical plans submitted to the back end can reference
previous logical plans (or parts of it) via a (scope id, node id) pair.
Having the concept of scope can provide support in the back-end for
purging information about entities that go out of scope. For instance,
the session id could be used as scope to garbage collect entities in the
back-end no longer needed.
4.      need to add a catalog that maps aliases to logical trees. For
instance, when a store operation is encountered, the front-end can
determine the set of dependent logical trees to serialize and send to
the back-end or (scope, id) of previous plans to reference. 
5.      Serialization process from the front-end to the back-end can
produce a representation of the logical plan and its dependencies that
include (scope, id) of each operators to send to the back end.

 

Back-End:

1.      back-end would maintain table of intermediate results
2.      compilation of logical plan to physical plan would take place in
the back-end
3.      a local back-end would generate physical trees using the
physical eval specs and physical cond's (as described above)
4.      a Hadoop back-end would compile logical plan to map/reduce

 

 

Reply via email to