[jira] Commented: (PIG-143) Proposal for refactoring of parsing logic in Pig

Santhosh Srinivasan (JIRA) Wed, 21 May 2008 16:52:21 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598866#action_12598866
 ]


Santhosh Srinivasan commented on PIG-143:
-----------------------------------------

Comments on Set 3 v2 patch:

   - Schema.java

   1. mFields is a pivate member variable of Schema. You will need to use the 
getFields() method
{code}
+        List<FieldSchema> mylist = schema.mFields ;
+        List<FieldSchema> otherlist = other.mFields ;
{code}


   2. Minor Comment: I prefer pre-increment to post-increment in loop 
iterations. In C++, pre-increment is faster as the compiler does not allocate a 
temporary register
{code}
+        int idx = 0;
+        for (; idx< iterateLimit ; idx ++) {
{code}

   3. Minor Comment: FieldSchema has a constructor that accepts the alias, 
schema and type of the field schema. This will preclude the need for a 
temporary variable.

{code}
+                    FieldSchema tmp = new FieldSchema(fs.alias, fs.schema) ;
+                    tmp.type = fs.type ;
+                    outputList.add(tmp) ;
{code}

{code}
+                    outputList.add(new FieldSchema(fs.alias, fs.schema, 
fs.type)) ;
{code}

   - TestBuiltin.java

   1. Two imports have been removed. These imports are used by test cases that 
are commented out for now but will be uncommented shortly.

{code}
-import org.apache.pig.impl.builtin.ShellBagEvalFunc;
-import org.apache.pig.impl.io.FileLocalizer;
{code}

   - test/org/apache/pig/test/TestTypeChecking.java

   1. The code in the helper section has changed with parser_chages_v11.patch. 
I have pasted the relevant changes below.

{code}
             if(roots.size() > 0) {
-                if (logicalOpTable.get(roots.get(0)) instanceof 
LogicalOperator){
-                    System.out.println(query);
-                    System.out.println(logicalOpTable.get(roots.get(0)));
+                for(LogicalOperator op: roots) {
+                    if (!(op instanceof LOLoad)){
+                        throw new Exception("Cannot have a root that is not 
the load operator LOLoad. Found " + op.getClass().getName());
+                    }
                 }
-                if ((roots.get(0)).getAlias()!=null){
-                    aliases.put((roots.get(0)).getAlias(), lp);
-                }


-    Map<String, LogicalPlan> aliases = new HashMap<String, LogicalPlan>();
+    Map<LogicalOperator, LogicalPlan> aliases = new HashMap<LogicalOperator, 
LogicalPlan>();

{code}

   - LOProject.java

   1. I don't see changes in getSchema and getType to use getFieldSchema. A 
call to getFieldSchema is required at the beginning of each method

   - validators/TypeCheckingVisitor.java 

   1. The schemas for LOProject are being recomputed. Instead you could 
probably use getSchema() and getFieldSchema() methods that are part of LOProject

{code}
+            // extracting schema from projection
+            List<FieldSchema> fsList = new ArrayList<FieldSchema>() ;
+            try {
+                for(int index: pj.getProjection()) {
+                    FieldSchema fs = null ;
+                    // typed input
+                    if (inputSchema != null) {
+                        fs = inputSchema.getField(index) ;
+                        FieldSchema newFs = new FieldSchema(fs.alias, 
fs.schema, fs.type) ;
+                        fsList.add(newFs) ;
+                    }
+                    // non-typed input
+                    else {
+                        FieldSchema newFs = new FieldSchema(null, 
DataType.BYTEARRAY) ;
+                        fsList.add(newFs) ;
+                    }
+                }
+                pj.setFieldSchema(new FieldSchema(null, new Schema(fsList), 
innerProject.getType()));
{code}

   2. There is an assumption that the innermost project (i.e., when the 
sentinel is true) can project only one column from its input. Is it true all 
the time? I tried to think of examples where it was violated and could not 
think of any :)


{code}
+        // if it's a sentinel, we just get the projected input type to it
+        else {
+            if (pj.getProjection().size() != 1) {
+                throw new AssertionError("Sentinel LOProject can have only "
+                                         + "1 projection") ;
+            }
{code}

   3. General Comment: Most of the operators are recomputing the schema, it 
would be easier if you just call getSchema() or getFieldSchema() as appropriate.



> Proposal for refactoring of parsing logic in Pig
> ------------------------------------------------
>
>                 Key: PIG-143
>                 URL: https://issues.apache.org/jira/browse/PIG-143
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Pi Song
>            Assignee: Pi Song
>         Attachments: ParserDrawing.png, pigtype_cycle_check.patch, 
> PostParseValidation_Set2.patch, PostParseValidation_Set2_v2.patch, 
> PostParseValidation_Set3.patch, PostParseValidation_Set3_v2.patch, 
> PostParseValidation_WorkingVersion_1.patch, 
> PostParseValidation_WorkingVersion_2.patch, SchemaMerge.patch
>
>
> h2. Pig Script Parser Refactor Proposal 
> This is my initial proposal on pig script parser refactor work. Please note 
> that I need your opinions for improvements.
> *Problem*
> The basic concept is around the fact that currently we do validation logics 
> in parsing stage (for example, file existence checking) which I think is not 
> clean and difficult to add new validation rules. In the future, we will need 
> to add more and more validation logics to improve usability.
> *My proposal:-*  (see [^ParserDrawing.png])
> - Only keep parsing logic in the parser and leave output of parsing logic 
> being unchecked logical plans. (Therefore the parser only does syntactic 
> checking)
> - Create a new class called LogicalPlanValidationManager which is responsible 
> for validations of the AST from the parser.
> - A new validation logic will be subclassing LogicalPlanValidator 
> - We can chain a set of LogicalPlanValidators inside 
> LogicalPlanValidationManager to do validation work. This allows a new 
> LogicalPlanValidator to be added easily like a plug-in. 
> - This greatly promotes modularity of the validation logics which  is 
> +particularly good when we have a lot of people working on different things+ 
> (eg. streaming may require a special validation logic)
> - We can set the execution order of validators
> - There might be some backend specific validations needed when we implement 
> new execution engines (For example a logical operation that one backend can 
> do but others can't).  We can plug-in this kind of validations on-the-fly 
> based on the backend in use.
> *List of LogicalPlanValidators extracted from the current parser logic:-*
> - File existence validator
> - Alias existence validator
> *Logics possibly be added in the very near future:-*
> - Streaming script test execution
> - Type checking + casting promotion + type inference
> - Untyped plan test execution
> - Logic to prevent reading and writing from/to the same file
> The common way to implement a LogicalPlanValidator will be based on Visitor 
> pattern. 
> *Cons:-*
>  - By having every validation logic traversing AST from the root node every 
> time, there is a performance hit. However I think this is neglectable due to 
> the fact that Pig is very expressive and normally queries aren't too big (99% 
> of queries contain no more than 1000 AST nodes).
> *Next Step:-*
> LogicalPlanFinalizer which is also a pipeline except that each stage can 
> modify the input AST. This component will generally do a kind of global 
> optimizations.
> *Further ideas:-*
> - Composite visitor can make validations more efficient in some cases but I 
> don't think we need
> - ASTs within the pipeline never change (read-only) so validations can be 
> done in parallel to improve responsiveness. But again I don't think we need 
> this unless we have so many I/O bound logics.
> - The same pipeline concept can also be applied in physical plan 
> validation/optimization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-143) Proposal for refactoring of parsing logic in Pig

Reply via email to