[
https://issues.apache.org/jira/browse/PIG-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598866#action_12598866
]
Santhosh Srinivasan commented on PIG-143:
-----------------------------------------
Comments on Set 3 v2 patch:
- Schema.java
1. mFields is a pivate member variable of Schema. You will need to use the
getFields() method
{code}
+ List<FieldSchema> mylist = schema.mFields ;
+ List<FieldSchema> otherlist = other.mFields ;
{code}
2. Minor Comment: I prefer pre-increment to post-increment in loop
iterations. In C++, pre-increment is faster as the compiler does not allocate a
temporary register
{code}
+ int idx = 0;
+ for (; idx< iterateLimit ; idx ++) {
{code}
3. Minor Comment: FieldSchema has a constructor that accepts the alias,
schema and type of the field schema. This will preclude the need for a
temporary variable.
{code}
+ FieldSchema tmp = new FieldSchema(fs.alias, fs.schema) ;
+ tmp.type = fs.type ;
+ outputList.add(tmp) ;
{code}
{code}
+ outputList.add(new FieldSchema(fs.alias, fs.schema,
fs.type)) ;
{code}
- TestBuiltin.java
1. Two imports have been removed. These imports are used by test cases that
are commented out for now but will be uncommented shortly.
{code}
-import org.apache.pig.impl.builtin.ShellBagEvalFunc;
-import org.apache.pig.impl.io.FileLocalizer;
{code}
- test/org/apache/pig/test/TestTypeChecking.java
1. The code in the helper section has changed with parser_chages_v11.patch.
I have pasted the relevant changes below.
{code}
if(roots.size() > 0) {
- if (logicalOpTable.get(roots.get(0)) instanceof
LogicalOperator){
- System.out.println(query);
- System.out.println(logicalOpTable.get(roots.get(0)));
+ for(LogicalOperator op: roots) {
+ if (!(op instanceof LOLoad)){
+ throw new Exception("Cannot have a root that is not
the load operator LOLoad. Found " + op.getClass().getName());
+ }
}
- if ((roots.get(0)).getAlias()!=null){
- aliases.put((roots.get(0)).getAlias(), lp);
- }
- Map<String, LogicalPlan> aliases = new HashMap<String, LogicalPlan>();
+ Map<LogicalOperator, LogicalPlan> aliases = new HashMap<LogicalOperator,
LogicalPlan>();
{code}
- LOProject.java
1. I don't see changes in getSchema and getType to use getFieldSchema. A
call to getFieldSchema is required at the beginning of each method
- validators/TypeCheckingVisitor.java
1. The schemas for LOProject are being recomputed. Instead you could
probably use getSchema() and getFieldSchema() methods that are part of LOProject
{code}
+ // extracting schema from projection
+ List<FieldSchema> fsList = new ArrayList<FieldSchema>() ;
+ try {
+ for(int index: pj.getProjection()) {
+ FieldSchema fs = null ;
+ // typed input
+ if (inputSchema != null) {
+ fs = inputSchema.getField(index) ;
+ FieldSchema newFs = new FieldSchema(fs.alias,
fs.schema, fs.type) ;
+ fsList.add(newFs) ;
+ }
+ // non-typed input
+ else {
+ FieldSchema newFs = new FieldSchema(null,
DataType.BYTEARRAY) ;
+ fsList.add(newFs) ;
+ }
+ }
+ pj.setFieldSchema(new FieldSchema(null, new Schema(fsList),
innerProject.getType()));
{code}
2. There is an assumption that the innermost project (i.e., when the
sentinel is true) can project only one column from its input. Is it true all
the time? I tried to think of examples where it was violated and could not
think of any :)
{code}
+ // if it's a sentinel, we just get the projected input type to it
+ else {
+ if (pj.getProjection().size() != 1) {
+ throw new AssertionError("Sentinel LOProject can have only "
+ + "1 projection") ;
+ }
{code}
3. General Comment: Most of the operators are recomputing the schema, it
would be easier if you just call getSchema() or getFieldSchema() as appropriate.
> Proposal for refactoring of parsing logic in Pig
> ------------------------------------------------
>
> Key: PIG-143
> URL: https://issues.apache.org/jira/browse/PIG-143
> Project: Pig
> Issue Type: Sub-task
> Components: impl
> Reporter: Pi Song
> Assignee: Pi Song
> Attachments: ParserDrawing.png, pigtype_cycle_check.patch,
> PostParseValidation_Set2.patch, PostParseValidation_Set2_v2.patch,
> PostParseValidation_Set3.patch, PostParseValidation_Set3_v2.patch,
> PostParseValidation_WorkingVersion_1.patch,
> PostParseValidation_WorkingVersion_2.patch, SchemaMerge.patch
>
>
> h2. Pig Script Parser Refactor Proposal
> This is my initial proposal on pig script parser refactor work. Please note
> that I need your opinions for improvements.
> *Problem*
> The basic concept is around the fact that currently we do validation logics
> in parsing stage (for example, file existence checking) which I think is not
> clean and difficult to add new validation rules. In the future, we will need
> to add more and more validation logics to improve usability.
> *My proposal:-* (see [^ParserDrawing.png])
> - Only keep parsing logic in the parser and leave output of parsing logic
> being unchecked logical plans. (Therefore the parser only does syntactic
> checking)
> - Create a new class called LogicalPlanValidationManager which is responsible
> for validations of the AST from the parser.
> - A new validation logic will be subclassing LogicalPlanValidator
> - We can chain a set of LogicalPlanValidators inside
> LogicalPlanValidationManager to do validation work. This allows a new
> LogicalPlanValidator to be added easily like a plug-in.
> - This greatly promotes modularity of the validation logics which is
> +particularly good when we have a lot of people working on different things+
> (eg. streaming may require a special validation logic)
> - We can set the execution order of validators
> - There might be some backend specific validations needed when we implement
> new execution engines (For example a logical operation that one backend can
> do but others can't). We can plug-in this kind of validations on-the-fly
> based on the backend in use.
> *List of LogicalPlanValidators extracted from the current parser logic:-*
> - File existence validator
> - Alias existence validator
> *Logics possibly be added in the very near future:-*
> - Streaming script test execution
> - Type checking + casting promotion + type inference
> - Untyped plan test execution
> - Logic to prevent reading and writing from/to the same file
> The common way to implement a LogicalPlanValidator will be based on Visitor
> pattern.
> *Cons:-*
> - By having every validation logic traversing AST from the root node every
> time, there is a performance hit. However I think this is neglectable due to
> the fact that Pig is very expressive and normally queries aren't too big (99%
> of queries contain no more than 1000 AST nodes).
> *Next Step:-*
> LogicalPlanFinalizer which is also a pipeline except that each stage can
> modify the input AST. This component will generally do a kind of global
> optimizations.
> *Further ideas:-*
> - Composite visitor can make validations more efficient in some cases but I
> don't think we need
> - ASTs within the pipeline never change (read-only) so validations can be
> done in parallel to improve responsiveness. But again I don't think we need
> this unless we have so many I/O bound logics.
> - The same pipeline concept can also be applied in physical plan
> validation/optimization.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.