Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/906#discussion_r135382803
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
---
    @@ -152,97 +157,75 @@ public void kill(boolean sendUpstream) {
         }
       }
     
    -  private void releaseAssets() {
    -    container.zeroVectors();
    -  }
    -
    -  private void clearFieldVectorMap() {
    -    for (final ValueVector v : mutator.fieldVectorMap().values()) {
    -      v.clear();
    -    }
    -  }
    -
       @Override
       public IterOutcome next() {
         if (done) {
           return IterOutcome.NONE;
         }
         oContext.getStats().startProcessing();
         try {
    -      try {
    -        injector.injectChecked(context.getExecutionControls(), 
"next-allocate", OutOfMemoryException.class);
    -
    -        currentReader.allocate(mutator.fieldVectorMap());
    -      } catch (OutOfMemoryException e) {
    -        clearFieldVectorMap();
    -        throw UserException.memoryError(e).build(logger);
    -      }
    -      while ((recordCount = currentReader.next()) == 0) {
    +      while (true) {
             try {
    -          if (!readers.hasNext()) {
    -            // We're on the last reader, and it has no (more) rows.
    -            currentReader.close();
    -            releaseAssets();
    -            done = true;  // have any future call to next() return NONE
    -
    -            if (mutator.isNewSchema()) {
    -              // This last reader has a new schema (e.g., we have a 
zero-row
    -              // file or other source).  (Note that some sources have a 
non-
    -              // null/non-trivial schema even when there are no rows.)
    +          injector.injectChecked(context.getExecutionControls(), 
"next-allocate", OutOfMemoryException.class);
    --- End diff --
    
    This patch tries to decouple the logic of record reader and scanbatch:
     - Record reader is responsible to add vectors to batch (via Mutator), and 
populate data
     - ScanBatch is responsible to interpret the output of record reader, by 
checking rowCount && Mutator.isNewSchema() to decide whether return 
OK_NEW_SCHEMA, OK, or NONE. 
    
    > What happens on the first reader? There is no schema, so any schema is a 
new schema. Suppose the file is JSON and the schema is built on the fly. Does 
the code handle the case that we have no schema (first reader), and that reader 
adds no columns?
    
    It's not true "any schema is a new schema". If the first reader has no 
schema and adds no columns, then Mutator.isNewSchema() should return false. 
Mutator.isNewSchema() returns true only after the last call, one or more happens
    
    - a new top level field is added, 
    - a field in a nested field is added, 
    - an existing field type is changed
    
    You may argue a more appropriate way to represent an empty JSON file is an 
empty schema. However, such idea would lead to various schema conflicts in 
down-stream operator, if other scan thread has non-empty JSON files. This is 
exactly what happened before this patch. 
    
    The proposal in this patch is to **ignore** empty JSON, since 1)rowCount=0, 
2)no new column were added to batch. 
     - If all the record readers for a scan thread return with rowCount = 0, 
and produce no new schema, then this Scan thread should return 'NONE' directly, 
without returning OK_NEW_SCHEMA. 
     - If at least one of reader return either with >0 row, or new schema, then 
Scan thread will return batch with new schema.
    - If all scan threads returns 'NONE' directly, implying the entire table 
does not have data/schema, this is what Project.handleFastNone() will deal with.
    
    >But, if the input is CSV, then we always have a schema. If the file has 
column headers, then we know that the schema is, say, (a, b, c) because those 
are the headers. Or, if the file has no headers, the schema is always the 
columns array. So, should we send that schema downstream? If so, should it 
include the implicit columns?
    
    If CSV always adds columns (either _a,b,c, or columns_), then ScanBatch 
will produce a batch with (a, b, c), or columns. It does not make sense to 
ignore those schema.  
      - In the case of file with header,  file with _a,b,c_ will lead to a 
batch with (a,b,c) while a file with _a,b,c,d_ will lead to a batch with 
(a,b,c,d). Those two files will cause a schema change, which is expected 
behavior.
      - In the case of file without header, all files will produce a batch with 
columns, which means there would be no schema change across different files, 
regardless whether they have row=0, or row > 0.
      



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to