[jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch

ASF GitHub Bot (JIRA) Fri, 11 Aug 2017 14:24:17 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124096#comment-16124096
 ]


ASF GitHub Bot commented on DRILL-5546:
---------------------------------------

GitHub user jinfengni opened a pull request:

    https://github.com/apache/drill/pull/906

    DRILL-5546: Handle schema change exception failure caused by empty in…

    …put or empty batche.
    
    1. Modify ScanBatch's logic when it iterates list of RecordReader.
       1) Skip RecordReader if it returns 0 row && present same schema. A new 
schema (by calling Mutator.isNewSchema() ) means either a new top level field 
is added, or a field in a nested field is added, or an existing field type is 
changed.
       2) Implicit columns are added and populated only when the input is not 
empty, i.e. the batch contains > 0 row or rowCount == 0 && new schema.
       3) ScanBatch will return NONE directly (called as "fast NONE"), if all 
its RecordReaders haver empty input and thus are skipped, in stead of returing 
OK_NEW_SCHEMA first.
    
    2. Modify IteratorValidatorBatchIterator to allow
       1) fast NONE ( before seeing a OK_NEW_SCHEMA)
       2) batch with empty list of columns.
    
    2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int 
column for 0 row input. Together with ScanBatch, Drill will skip empty json 
files.
    
    3. Modify binary operators such as join, union to handle fast none for 
either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch, 
except for MergeJoin as its implementation is quite different from others.
    
    4. Fix and refactor union all operator.
      1) Correct union operator hanndling 0 input rows. Previously, it will 
ignore inputs with 0 row and put nullable-int into output schema, which causes 
various of schema change issue in down-stream operator. The new behavior is to 
take schema with 0 into account
      in determining the output schema, in the same way with > 0 input rows. By 
doing that, we ensure Union operator will not behave like a schema-lossy 
operator.
      2) Add a UnionInputIterator to simplify the logic to iterate the 
left/right inputs, removing significant chunk of duplicate codes in previous 
implementation.
      The new union all operator reduces the code size into half, comparing the 
old one.
    
    5. Introduce UntypedNullVector to handle convertFromJson() function, when 
the input batch contains 0 row.
      Problem: The function convertFromJSon() is different from other regular 
functions in that it only knows the output schema after evaluation is 
performed. When input has 0 row, Drill essentially does not have
      a way to know the output type, and previously will assume Map type. That 
works under the assumption other operators like Union would ignore batch with 0 
row, which is no longer
      the case in the current implementation.
      Solution: Use MinorType.NULL at the output type for convertFromJSON() 
when input contains 0 row. The new UntypedNullVector is used to represent a 
column with MinorType.NULL.
    
    6. HBaseGroupScan convert star column into list of row_key and column 
family. HBaseRecordReader should reject column star since it expectes star has 
been converted somewhere else.
      In HBase a column family always has map type, and a non-rowkey column 
always has nullable varbinary type, this ensures that HBaseRecordReader across 
different HBase regions will have the same top level schema, even if the region 
is
      empty or prune all the rows due to filter pushdown optimization. In other 
words, we will not see different top level schema from different 
HBaseRecordReader for the same table.
      However, such change will not be able to handle hard schema change : c1 
exists in cf1 in one region, but not in another region. Further work is 
required to handle hard schema change.
    
    7. Modify scan cost estimation when the query involves * column. This is to 
remove the planning randomness since previously two different operators could 
have same cost.
    
    8. Add a new flag 'outputProj' to Project operator, to indicate if Project 
is for the query's final output. Such Project is added by TopProjectVisitor, to 
handle fast NONE when all the inputs to the query are empty
    and are skipped.
      1) column star is replaced with empty list
      2) regular column reference is replaced with nullable-int column
      3) An expression will go through ExpressionTreeMaterializer, and use the 
type of materialized expression as the output type
      4) Return an OK_NEW_SCHEMA with the schema using the above logic, then 
return a NONE to down-stream operator.
    
    9. Add unit test to test operators handling empty input.
    
    10. Add unit test to test query when inputs are all empty.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinfengni/incubator-drill DRILL-5546

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/906.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #906
    
----
commit b0110140f8375809af3deddf9881e64dc1242886
Author: Jinfeng Ni <j...@apache.org>
Date:   2017-05-17T23:08:00Z

    DRILL-5546: Handle schema change exception failure caused by empty input or 
empty batche.
    
    1. Modify ScanBatch's logic when it iterates list of RecordReader.
       1) Skip RecordReader if it returns 0 row && present same schema. A new 
schema (by calling Mutator.isNewSchema() ) means either a new top level field 
is added, or a field in a nested field is added, or an existing field type is 
changed.
       2) Implicit columns are added and populated only when the input is not 
empty, i.e. the batch contains > 0 row or rowCount == 0 && new schema.
       3) ScanBatch will return NONE directly (called as "fast NONE"), if all 
its RecordReaders haver empty input and thus are skipped, in stead of returing 
OK_NEW_SCHEMA first.
    
    2. Modify IteratorValidatorBatchIterator to allow
       1) fast NONE ( before seeing a OK_NEW_SCHEMA)
       2) batch with empty list of columns.
    
    2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int 
column for 0 row input. Together with ScanBatch, Drill will skip empty json 
files.
    
    3. Modify binary operators such as join, union to handle fast none for 
either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch, 
except for MergeJoin as its implementation is quite different from others.
    
    4. Fix and refactor union all operator.
      1) Correct union operator hanndling 0 input rows. Previously, it will 
ignore inputs with 0 row and put nullable-int into output schema, which causes 
various of schema change issue in down-stream operator. The new behavior is to 
take schema with 0 into account
      in determining the output schema, in the same way with > 0 input rows. By 
doing that, we ensure Union operator will not behave like a schema-lossy 
operator.
      2) Add a UnionInputIterator to simplify the logic to iterate the 
left/right inputs, removing significant chunk of duplicate codes in previous 
implementation.
      The new union all operator reduces the code size into half, comparing the 
old one.
    
    5. Introduce UntypedNullVector to handle convertFromJson() function, when 
the input batch contains 0 row.
      Problem: The function convertFromJSon() is different from other regular 
functions in that it only knows the output schema after evaluation is 
performed. When input has 0 row, Drill essentially does not have
      a way to know the output type, and previously will assume Map type. That 
works under the assumption other operators like Union would ignore batch with 0 
row, which is no longer
      the case in the current implementation.
      Solution: Use MinorType.NULL at the output type for convertFromJSON() 
when input contains 0 row. The new UntypedNullVector is used to represent a 
column with MinorType.NULL.
    
    6. HBaseGroupScan convert star column into list of row_key and column 
family. HBaseRecordReader should reject column star since it expectes star has 
been converted somewhere else.
      In HBase a column family always has map type, and a non-rowkey column 
always has nullable varbinary type, this ensures that HBaseRecordReader across 
different HBase regions will have the same top level schema, even if the region 
is
      empty or prune all the rows due to filter pushdown optimization. In other 
words, we will not see different top level schema from different 
HBaseRecordReader for the same table.
      However, such change will not be able to handle hard schema change : c1 
exists in cf1 in one region, but not in another region. Further work is 
required to handle hard schema change.
    
    7. Modify scan cost estimation when the query involves * column. This is to 
remove the planning randomness since previously two different operators could 
have same cost.
    
    8. Add a new flag 'outputProj' to Project operator, to indicate if Project 
is for the query's final output. Such Project is added by TopProjectVisitor, to 
handle fast NONE when all the inputs to the query are empty
    and are skipped.
      1) column star is replaced with empty list
      2) regular column reference is replaced with nullable-int column
      3) An expression will go through ExpressionTreeMaterializer, and use the 
type of materialized expression as the output type
      4) Return an OK_NEW_SCHEMA with the schema using the above logic, then 
return a NONE to down-stream operator.
    
    9. Add unit test to test operators handling empty input.
    
    10. Add unit test to test query when inputs are all empty.

----


> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by 
> empty batch. This JIRA is opened as an umbrella for all those related JIRAS ( 
> such as DRILL-4686, DRILL-4734, DRILL4476, DRILL-4255, etc).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch

Reply via email to